If you only read the bombastic headlines, you might be forgiven for thinking that Big Data is the name of a real-life superhero: fighting crime, busting traffic jams and even curing diseases. But when you work with data for a living, you quickly find out that underneath the shiny facade, ‘doing big data’ is also a major pain.
Welcome to the Data Grind
In a world that thinks of data in terms of Excel spreadsheets and Tableau-style dashboards, it’s difficult to convey how difficult is to work with actual big data - high-volume, high-velocity data streams generating billions of records every single day (think IoT sensors or behavioral analytics). At this scale, things get pretty ugly.
For example, let’s say you want to answer a question with big data. It doesn’t have to be as complex as autonomous driving; it can be as simple as “how are visitors engaging with my mobile app”. Sounds easy enough, right? All you need to do is -
- Build a data lake - a raw data repository, where you will have to store the data according to a gazillion best practices around compression, partitioning and even naming conventions. Not adhering to best practices will easily cause everything to break down further down the road.
- Write code to understand what your data looks like in terms of schema, fields, etc. That’s right - you need to write code just to understand what data you’re bringing in.
- Write additional long and complicated code for ETL jobs, which you need to perform ‘blind’ as there is no visual representation of the data. Also, running each job takes hours, so better make sure you get it right the first time!
- Allocate a developer to maintain an orchestration system like Apache Airflow or Apache NiFi to run the ETL jobs efficiently and consistently.
- If you’re unlucky enough to need stateful ETLs, you’ll also need to spin-up a NoSQL database to manage state.
- Integrate and maintain an analytics database such as Amazon Redshift to run SQL queries against.
- Finally answer the business question, 4-8 months and thousands of developer hours after you first asked it.
And this shockingly difficult and complex process is not a one-off affair. Answering the next question, adding new data sources or enabling another use case will all take weeks or months of development. It’s a nightmare of operating a dozen code-intensive moving parts in tandem to the tune of hundreds of thousands of dollars in software, storage and manpower costs.
Catching Up with Small Data
All of this is especially frustrating when you compare it with a similar scenario for someone who’s working with ‘small’ data (e.g., ERP and financial data). For that you would get a database, slap some SQL queries and a dashboard on top of it and bam! – you already have something useful.
When it comes to small data, code-intensive processes are often replaced by GUI-based tools; cumbersome, patchwork architectures are less common; and end users are relatively self-sufficient. One person with basic SQL knowledge can access and use business data to a reasonable extent.
It’s unlikely that big data is going to be this simple anytime soon, but there are some things that can be done to close the gap.
In developing the Upsolver Data Lake Platform, we attacked the problem of simplifying big data from multiple angles. Our goal was to reduce the amount of time and resources it takes to transform raw data streams into usable data, as much as possible. We identified three major areas where this can be done:
- Combining building blocks. There’s no real reason you need to use three separate open-source frameworks for data cataloging, integration and serving. By engineering a system that’s built to address common big and streaming data analytics use cases, we’ve cut down the number of systems you need to operate to get your data into a workable state.
- Visual data management. Instead of writing code to pull in a sample of the data and working with only a vague notion of what the actual schema is, we’ve developed a visual catalogue that instantly visualizes the data structure and important statistics such as distinct values, value distribution and occurrence in ingested events.
- Automate code-intensive processes. A lot of the work in big data is just maintaining best practices around storage, partitioning, working with SQL engines, etc. We’ve written thousands of lines of code to include these best practices as built-in features of our system, ensuring highly-optimized performance at minimal costs.
Has this made big data as simple as Excel? No, not really. You still need to have a good grasp of your data architecture. But you don’t need massive data engineering teams running 6 month projects just to run some analytical queries. You can launch data science projects faster. Your DBA is spending less time on infrastructure.
In other words, big data sucks a little bit less - and that’s a win in my book!