It sounds like a simple question. You have to load several data sets, implement some data cleansing, perform some matching to third party data, compute several aggregates, develop some rankings, group several dimensions, benchmark against another data set, analyze for trends and then normalize the data for multiple data visualizations.
In all likelihood, the algorithms that perform these functions are going to be implemented by different people in different technologies and perhaps at different stages in the analysis. End to end, they represent a complex data flow from data sources, computations, analysis, and delivery.
Key Data Architecture Considerations
So my question is, where are you implementing these data processing functions? Where are your algorithms stored? How are they documented? How do you answer questions around, “Where should I do this data processing?” What is your big data culture – Are you more likely to let data scientists determine what tool to use for different needs, or are you centralizing these data architecture decisions?
Once implemented, how do you review to determine what parts of your data processing needs to be refactored? Maybe a step isn’t performing well? Maybe a data visualization required some last mile data cleansing that should be moved upstream to benefit other analysis? Perhaps some algorithm fails to meet the “KT” (Knowledge Transfer) test and is so complex it will be impossible to be maintained?
Or maybe, you’ve implemented something in a Big Data tool that has just released a major upgrade requiring substantial changes to the implementation? Or even worse, perhaps the tool you selected is on the downside, having never achieved critical mass and now you have to explore alternatives and consider switching costs.
The reverse question is equally important. Perhaps you’re bundling some activity in the wrong tool and should consider expanding your technical architecture? Perhaps you are spending too many cycles getting SQL to perform and should consider a NoSQL store? Maybe the Python scripts you developed for data integration are becoming unmanageable and an ETL tool is needed?
Managing the Evolving Big Data Landscape and Growing Business Need
-
- Invest in basic version control so that you can track changed implementations across platforms and practices.
-
- Evolve a data governance practice that starts with basic data dictionaries and documentation on algorithms.
-
- Build an agile data practice to make sure participants focus on the problems of highest business value and demo their results
- Build an agile data practice to make sure participants focus on the problems of highest business value and demo their results
-
- Develop operational KPIs covering development cost, implementation complexity and system performance to sense when an implementation shows signs of becoming a pain point.
- Capture technical debt data quality barriers and other things that need improvement.
And most important:
- Invest time/resources to perform R&D and experiment.
Thanks to Matt Turck: Is Big Data Still a Thing





















Leave a Reply