Recent surveys have shown that the data lake market is expected to grow to $20.1 billion by 2024, with a growing number of organizations looking to deploy a data lake in coming years. However, despite growing interest in big data initiatives, a roadblock many organizations run into is the complex, manual nature of building a data lake – which requires hiring skilled personnel that are in dire shortage.
This raises the question: why are data lakes still so difficult in the age of everything-as-a-service? Why are they still dependent on a select group of specialists skilled in arcane programming languages and frameworks? In other words – what’s stopping the data lake from becoming ‘productized’?
Most organizations didn’t really do big data (until recently)
Data lakes are, almost by definition, meant for big data. You build a data lake when the volume, velocity or variety of the data you’re dealing with makes a database unviable. Which types of organizations actually have big data and, consequently, a need for data lake storage?
While “big data” as a buzzword has been around for decades, the number of companies engaged in bona fide big data initiatives in production was never significant. AI, predictive analytics, unstructured data analysis – these types of projects were, and still are, more aspirational than operational for most organizations.
Outside of behemoths such as Google and Facebook, or specific niches such as predictive advertising and algo-trading, most data initiatives could be termed “more BI and AI” – solving challenges around access to structured data, reporting and SQL-based analytics. Very few companies actually had database-breaking levels of data.
The types of companies that work with big data were either engineering-focused organizations or massive enterprises, which could afford to spend massive resources on Hadoop-based initiatives and required tailor-made solutions (see our article on Building versus Buying Data Infrastructure). The mass market never really needed data lakes, and thus no mass market solution was born.
However, there is reason to believe that this is no longer the case.
What’s changed in 2020?
While big data has not gone completely mainstream – a mom-and-pop retail company is probably more concerned with inventory management than neural networks – there is a definite growth in data volumes and in appetite for advanced analytics. Several factors are driving this growth:
- As more and more aspects of life continue to move online, more data is being generated (think shopping, dating, networking, etc.) Even non-tech companies are now heavily invested in software development through their websites and mobile applications. A retail company whose website gets a million monthly visitors suddenly needs to start thinking in big data terms.
- Analytics and reporting is somewhat of a ‘solved’ problem, with swathes of self-service BI and database tools that can be used to generate dashboards and reports in a way that’s (relatively) easy. This creates interest in more experimental use cases focused on data science and machine learning.
- Growing interest in mobile, industrial IoT, wearables, and AI applications – all of which generate massive volumes of data.
The growth in big and unstructured data means that more companies need data lakes, and more users within the organization want access to the data. This means that data lake adoption can no longer be limited to data engineers: DBAs, citizen engineers, analysts, scientists are all users that don’t understand big data engineering but want to use a data lake. They need an easier method than code.
Why today’s data lakes cannot be considered self-service
Infrastructure as a service is widely available and used, with Amazon Web Services being the largest provider, while Microsoft’s Azure and Google Cloud both picking up pace. Within these cloud offerings, one can purchase various components of data infrastructure as managed services – wherein the cloud provider deals with all things hardware and resource provisioning, while the end user can focus on developing applications.
This is not yet a reality for data lakes. Elastic storage is available with solutions such as Amazon S3; but data management, schema discovery and ETL still revolve around Spark/Hadoop – coding platforms which are built for big data engineers (read more about the limitations of Apache Spark).
When the only companies with data lakes in productions were companies with vast data engineering knowledge, there was no need for self-service alternatives. Various solutions emerged around managed Spark, or wrapping Spark pipelines with a visual UI; however, the underlying reliance on code means these services are still built for domain experts rather than data analysts or even ‘regular’ developers.
Data lakes are still built by and for engineers. They are defined by technical, rather than business requirements; operations are iterative rather than declarative. If the idea behind software-as-a-service is that businesses can focus on features rather than infrastructure – with data lakes this is rarely the case.
Building a GUI for cloud data lakes
We’ve seen that there is a growing demand for a simpler, more accessible data lake, and that current solutions do not fit the bill. That’s where our own product comes into the picture – you can read a bit more below, or watch this webinar instead.
One of the ideas behind Upsolver is to introduce self-service capabilities to cloud data lakes. Upsolver’s data lake ETL platform is built to provide end-to-end self-service for building and operationalizing data lakes – stream ingestion, data transformation, and data preparation for analytics are all ‘productized’ behind a visual interface and SQL
In order to give users an experience that’s as close to data-lake-as-a-service as possible, we relied on this principles:
- A platform for every developer rather than domain experts: For the data lake to be accessible, it needs to be managed with the tools that most developers already know rather than requiring specialized knowledge. That means replacing the reliance on Spark/Hadoop, Scala and Python with the one language that every data-savvy user knows – good old ANSI SQL.
- No coding required: Upsolver allows you to build and manage a data lake declaratively, defining your business logic and tables using a visual interface and SQL rather than code.
- Runs on fully managed infrastructure: By storing all data on S3 and automatically provisioning compute resources as needed, Upsolver removes all infrastructure management from the equation (no need to deploy or configure clusters for ETL jobs).
The specific features of the product, such as automatic data partitioning and compaction, integration with Hive metastores and schema discovery – are built around these principles. To get a deeper glimpse into how the product works, you can schedule a free demonstration here.
More productized offerings to come
To the best of our knowledge, Upsolver is the only product currently on the market that offers this level of self-service for cloud data lakes. However, with growing interest in data lakes we expect to see more competition and more productized offerings for data ingestion, serverless querying and more. Some of these services will compete with our own product, others will compliment it; either way, we’re looking forward to seeing what lies ahead!