Explore our expert-made templates & start with the right one for you.
Last update: September 16, 2022
This article was originally written before the current wave of buzz surrounding data lakehouse and data mesh architectures. However, we believe that if anything, the recent developments in the big data market have validated the original thesis presented here – that the market is moving towards self-service solutions for data lakes; we also still believe that much of the existing technology still relies on a deep understanding of Spark/Hadoop, and is thus still limited to a small cohort of experts despite vendors’ claims otherwise. Do you agree?
Recent surveys have shown that the data lake market is expected to grow to $20.1 billion by 2024, with a growing number of organizations looking to deploy a data lake in coming years. However, despite growing interest in data initiatives, a roadblock many organizations run into is the complex, manual nature of building a data lake – which requires hiring skilled personnel that are in dire shortage.
This raises the question: why are data lakes still so difficult in the age of everything-as-a-service? Why are they still dependent on a select group of specialists skilled in arcane programming languages and frameworks? In other words – what’s stopping the data lake from becoming ‘productized’?
Why data lakes were traditionally more of a niche interest
Data lakes are, almost by definition, meant for big data. You build a data lake when the volume, velocity or variety of the data you’re dealing with makes a database unviable. Which types of organizations actually have a need to deal with petabytes of structured and unstructured data – and, consequently, a need for data lake storage?
While “big data” as a buzzword has been around for decades, the number of companies engaged in bona fide big data initiatives in production was never significant. AI, predictive analytics, unstructured data analysis – these types of projects were, and still are, more aspirational than operational for most organizations.
Outside of behemoths such as Google and Facebook, or specific niches such as predictive advertising and algo-trading, most data initiatives could be termed “more BI and AI” – solving challenges around access to structured data, reporting and SQL-based analytics. Very few companies actually had database-breaking levels of raw data.
The types of companies that work with big data were either engineering-focused organizations or massive enterprises, which could afford to pour near-endless resources into Hadoop-based initiatives. They also required tailor-made solutions (see our article on Building versus Buying Data Infrastructure), making GUI-based solutions less attractive. The mass market never really needed data lakes to analyze data, and thus no mass market solution was born.
However, there is reason to believe that this is no longer the case.
The emerging need for data lake services
While big data has not gone completely mainstream – a mom-and-pop retail company is probably more concerned with inventory management than neural networks – there is a definite growth in data volumes and in appetite for advanced analytics. Several factors are driving this growth:
- Raw data volumes. As more and more aspects of life continue to move online, more data is being generated (think shopping, dating, networking, etc.) Even non-tech companies are now heavily invested in software development through their websites and mobile applications. A retail company whose website gets a million monthly visitors suddenly needs to start thinking in big data terms.
- The move from data analytics do data science. Analytics and reporting is somewhat of a ‘solved’ problem, with swathes of self-service BI and database tools that can be used to generate dashboards and reports in a way that’s (relatively) easy. This creates interest in more experimental use cases focused on data science and machine learning.
- Streaming sources. Growing interest in mobile, industrial IoT, wearables, and AI applications – all of which generate massive volumes of data.
The growth in big and unstructured data means that more companies need data lakes, and more users within the organization want access to the data. This means that data lake adoption can no longer be limited to data engineers: DBAs, citizen engineers, analysts, scientists are all users that don’t understand big data engineering but want to use a data lake. They need an easier method than code, and an experience that’s more similar to the data warehouses they’re used to work with. This is why the differences between solutions such as Databricks and Snowflake are getting smaller – there is a rush to offer SQL-based, managed data lake services.
Why today’s data lakes cannot be considered self-service
Infrastructure as a service is widely available and used, with Amazon Web Services being the largest provider, while Microsoft’s Azure and Google Cloud both picking up pace. Within these cloud offerings, one can purchase various components of data infrastructure as managed services – wherein the cloud provider deals with all things hardware and resource provisioning, while the end user can focus on developing applications.
This is not yet a reality for data lakes – there is no lake-as-a-service. Elastic storage is available with solutions such as Amazon S3; but data management, schema discovery and ETL still revolve around Spark/Hadoop – coding platforms which are built for big data engineers (read more about the limitations of Apache Spark). Features such as data security and permissions for sensitive data must also often be built on top of other infrastructure.
When the only companies with data lakes in productions were companies with vast data engineering knowledge, there was no need for self-service alternatives. Various solutions emerged around managed Spark, or wrapping Spark pipelines with a visual UI; however, the underlying reliance on code means these services are still built for domain experts rather than data analysts or even ‘regular’ developers.
Data lakes are still built by and for engineers. They are defined by technical, rather than business requirements; operations are iterative rather than declarative. If the idea behind software-as-a-service is that businesses can focus on features rather than infrastructure – with data lakes this is rarely the case.
Building a GUI for cloud data lakes – the data lake as a service
We’ve seen that there is a growing demand for a simpler, more accessible data lake architecture, and that current solutions do not fit the bill. That’s where our own product comes into the picture – you can read a bit more below, or watch this webinar instead.
One of the ideas behind Upsolver is to introduce self-service capabilities to cloud data lakes. Upsolver’s data lake analytics platform is built to provide end-to-end self-service for building and operationalizing data lakes – stream ingestion, data transformation, and data preparation for analytics are all ‘productized’ behind a visual interface and SQL
In order to give users an experience that’s as close to data-lake-as-a-service as possible, we relied on this principles:
- A platform for every developer rather than domain experts: For the data lake to be accessible, it needs to be managed with the tools that most developers already know rather than requiring specialized knowledge. That means replacing the reliance on Spark/Hadoop, Scala and Python with the one language that every data-savvy user knows – good old ANSI SQL.
- No coding required: Upsolver allows you to build and manage a data lake declaratively, defining your business logic and tables using a visual interface and SQL rather than code.
- Runs on fully managed infrastructure: By storing all data on S3 and automatically provisioning compute resources as needed, Upsolver removes all infrastructure management from the equation (no need to deploy or configure clusters for ETL jobs).
The specific features of the product, such as automatic data partitioning and compaction, integration with Hive metastores and schema discovery – are built around these principles. To get a deeper glimpse into how the product works, you can schedule a free demonstration here.
More productized offerings to come
To the best of our knowledge, Upsolver is the only product currently on the market that offers this level of self-service for cloud data lakes. However, with growing interest in data lake technology, we expect to see more competition and more productized offerings for data ingestion, serverless querying, data security and more. Some of these services will compete with our own product, others will compliment it; either way, we’re looking forward to seeing what lies ahead!
Try SQLake for free (early access)
SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience. Try it for free. No credit card required.