The new year is upon us, as is a delightfully symmetrical new decade. And while we’re just exhausted as you with the endless ‘summary of the decade’ type lists, tweets and articles, there’s still some fun to be had in trying to predict what’s coming in the next year or ten. Here is our humble attempt to guess what the future has in store when it comes to our particular domain of big data infrastructure - data lakes, stream processing and ETL.
A Tectonic Shift from Infrastructure to Value
In 2020, big data and data lakes will continue to follow in the footsteps of business intelligence and databases - moving from lengthy IT-centric deployments to agile operations focused on delivering business value.
We believe that the main trend to define big data in 2020 and beyond is the ongoing shift towards self-service, which is becoming a viable possibility in big data. Organizations want to spend their engineering resources on the places that contribute most to their bottom line - and ‘data pipeline maintenance’ is rarely that place (see Big Data Infrastructure: When to Build, When to Buy).
The current state of big data tools and data lake implementations is similar to that of database and analytics deployments in the early 2000s: neverending, IT-centric projects that rely on a patchwork of open-source and proprietary building blocks, locked behind the closed gate of rare domain expertise.
However, in the two decades that have passed since then, analytics has become much simpler and more business-driven, with a wide variety of BI, ETL and analytics tools that are built for business users rather than engineering experts.
We predict that a similar shift will happen in big data, driven by tools that make data lakes more accessible and remove the engineering roadblocks that currently make it excruciatingly difficult to get value from big and streaming data. And yes, our own platform is definitely in this category.
This key trend will drive the rest of the developments we are likely to see this year, including:
More Productized Offerings
2020 will see the introduction of new product-based solutions to data engineering problems that were traditionally solved with code.
We expect to see more product-based offering from software vendors for big data ingestion, processing and analytics - problems which used to be solved using open-source frameworks such as Apache Spark/Hadoop, Flink and Flume.
This is both a driver and a consequence of the growing demand for self-service in the data lake space: the availability of self-service tools leads the business to expect more value from data lake implementations and increases the appetite for investment in this area, which in turn leads to more tools being developed and released. We expect to see dozens of such tools being released in the coming years.
The World Moves Beyond Spark
Spark has ruled the big data landscape for the past decade, but it is ill-suited for the needs of modern data-driven organizations.
Apache Spark is a versatile and powerful framework and has become the de-facto standard for big data ETL. It is entirely ubiquitous in that even the products in this space are typically managed Spark deployments (including Databricks and Amazon Glue).
However, and while Spark is still an excellent solution for many scenarios, there are also many cases where it can be a major hindrance (see Spark Alternatives by Use Case): Spark pipelines are complex and difficult to maintain, and typically require significant data engineering efforts in an age when data engineers are notoriously difficult to find; it is also ill-suited for low-latency use cases due to its reliance on batch processing.
We believe that the demand for self-service infrastructure will challenge the current dominance of Apache Spark, and lead more organizations to explore alternative solutions that focus on ease of use and SQL-based data transformation, rather than lengthy coding in Scala or Python.
Logs and Semi-structured Data Go Mainstream
The growth of mobile, web and IoT data means that almost every company has a log analysis project somewhere in the works.
Log analysis used to be the sole domain of DevOps engineers monitoring server downtime, or security researchers investigating suspicious anomalies in network traffic. However, this reality is quickly changing as a growing number of organizations undertake projects that generate significant amount of log data: almost any mid-sized (or larger) company is likely to have an ongoing initiative around mobile applications, advanced web analytics for IoT devices.
These types of projects generate massive amounts of log data that will challenge existing data warehouse deployments, and lead more organizations to explore a data lake approach to log analysis.
Growing Interest in Streaming Architecture
Machine learning and real-time analytics create a need for fresher data, driving increased interest in streaming architecture.
In addition to dealing with streaming sources such as web activity and sensor data, organizations are increasingly gravitating towards advanced analytical use cases - AI, machine learning and real-time analytics.
Many such initiatives are expected to go from experimental to operational in the coming years, and streaming data infrastructure is needed in order to enable real-time and near real-time access to data.
Simplify Your Data Architecture in 2020
Ready to jump on the self-service bandwagon? Upsolver can get you there faster. Schedule a demo with one of our solution architects to learn how you can introduce flexibility, agility and self-service to your big data infrastructure; or check out our technical whitepaper to understand how the magic happens.
Request a free consultation with Upsolver’s streaming data experts
See how you can spin up an end-to-end streaming data pipeline in minutes.
Schedule a Demo