Is it time to move your data lake to the cloud? As with any infrastructural choice, there are advantages and trade-offs to deploying in the cloud vs on-premises, and the decision needs to be made on ad-hoc basis based on considerations such as scale, cost, and available technical resources.
There are many data lake cloud services that offer a compelling alternative to traditional on-premise infrastructure. This post will walk you through the basics of cloud-based data lakes, and explain the data lake offering by the big three cloud providers, so you can make an informed decision as you transition your data lake to the cloud.
In this article you will learn:
- What is a data lake
- Challenges of on-prem data lakes and advantages of moving to the cloud
- Cloud data lake offerings by AWS, Microsoft and Google
- Drawbacks of cloud data lakes
- How to set up data lake analytics instantly in the cloud
What is a Data Lake?
A data lake is a scalable, centralized repository that can store raw data. Data lakes differ from data warehouses as they can store both structured and unstructured data, which you can process and analyze later. This removes much of the overhead associated with traditional database architectures, which would typically involve lengthy ETL and data modeling when ingesting the data (to impose schema-on-write).
The schema-on-read data model, on the other hand, allows you to structure data when you retrieve it from storage. This provides a higher level of flexibility in data analysis and exploration while enabling organizations to easily store massive volumes of data.
Still confused? Check out these 4 examples of data lake architectures.
Data lakes offer organizations a solution for collecting big data, which can then be manipulated and mined for insights by data scientists, analysts, and developers. However, data stored in a data lake is unstructured and difficult to use. To make the data useful to data consumers, you need to process and prepare it for analysis, which is often challenging for organizations that lack extensive big data engineering resources.
Cloud Data Lake or On-Premises Data Lake?
While early data lakes were built on HDFS clusters on-premises, organizations are moving their data lakes to the cloud as infrastructure-as-a-service offerings grow increasingly popular. What are the problems faced by organizations when setting up on-premise infrastructure, and do cloud providers offer a complete solution?
Challenges of On-Premise Data Lakes
- Complexity of building data pipelines—when you build your own on-premise infrastructure, you commonly need to manage both the hardware infrastructure - spinning up servers, orchestrating batch ETL jobs, and dealing with outages and downtime - as well as the software side, which requires data engineers to integrate a wide range of tools used to ingest, organize, pre-process and query the data stored in the lake.
- Maintenance Costs—aside from the upfront investment needed to purchase servers and storage equipment, there are ongoing management and operating costs when operating an on-premise data lake, mostly manifesting in IT and engineering costs.
- Scalability—if you want to scale up your data lake to support more users or bigger data, you’ll need to manually add and configure servers. You need to keep a close eye on resource utilization, and any additional servers create additional maintenance and operating costs.
Advantages of Moving Your Data Lake to the Cloud
- Focus on business value, not infrastructure—use the cloud to store big data in the cloud and eliminate the need to build and maintain infrastructure, so you can use engineering resources to develop new functionality, which you can connect to business value.
- Lower engineering costs—you can build data pipelines more efficiently with cloud-based tools. The data pipeline is often pre-integrated, so you can get a working solution without investing hundreds of hours in data engineering.
- Use managed services to scale up—the cloud provider can manage scaling for you. Some data lake cloud services such as Amazon S3 and Athena provide completely transparent scaling, so you don’t need to add machines or manage clusters.
- Agile infrastructure—cloud services are flexible and offer on-demand infrastructure. If new use cases come up for your data lake, you can re-think, re-engineer and re-architect your data lake more easily.
- Up-to-date technologies—cloud-based data lakes update automatically and make the latest technology available. You can also add new cloud services as they become available, without changing your architecture.
- Reliability and availability—cloud providers work to prevent service interruptions, storing redundant copies of data on different servers. Availability spans several data centers. Amazon S3, for example, promises “11 nines” of durability for your data.
Cloud Data Lake Architectures: the Big Three
To make things more concrete, let’s look at data lake offerings provided by each of the three leading infrastructure-as-a-service providers (according to Gartner).
AWS Data Lake
Amazon Web Services offer a number of data lake solutions, including Amazon Simple Storage Service (Amazon S3), and DynamoDB, which is a NoSQL database with low latency, used for some high-end data lake scenarios. Data ingestion tools like Kinesis Streams, Kinesis Firehose, and Direct Connect enable you to transfer large amounts of data to S3.
The AWS suite of tools also includes a database migration service to facilitate the transfer of on-premise data to the cloud and data lake reference implementation. Elasticsearch is provided as a managed service, offering a simplified process for querying log data, and Athena provides serverless interactive queries. You can customize these tools using AWS CloudFormation scripts.
Another way to enhance a data lake on AWS is by using Amazon Lambda to inject metadata into S3 data as it is being loaded (see Amazon’s reference architecture).
Azure Data Lake
Microsoft Azure provides a data lake architecture that consists of two layers, one for storage and one for analysis. The storage layer, called Azure Data Lake Store (ADLS), has unlimited storage capacity and can store data in almost any format. It is built on the HDFS standard, which makes it easier to migrate existing Hadoop data.
The analytics layer comprises Azure Data Lake Analytics and HDInsight, which is a cloud-based analytics service. You can write your own code to customize analysis and data transformation tasks. You can also use tools like Microsoft's Analytics Platform System to query datasets.
Google Data Lake
The Google Cloud Platform (GCP) provides its own data lake offering. Google Cloud Storage is a general purpose storage service that provides lower cost options, which are suitable for data lake scenarios. On top of this storage layer, you can use GCP tools like Cloud Pub/Sub, Dataflow, Storage Transfer Service and the Transfer Appliance to ingest data into your data lake.
On the analytics side, the GCP offering is less mature than the other providers. GCP offers a managed Hive service as part of Cloud Dataproc, and also lets you use Google BigQuery to run high performance queries against large data volumes. For data mining and exploration, Google suggests using Cloud Datalab, which includes a managed Jupyter Notebook service.
Drawbacks of Cloud Data Lakes
Watch Your Storage Costs
The primary downside of moving your data lake to the cloud is storage costs. In the cloud, you pay for storage by the hour. Providers like Amazon offer multiple options for storing your data with variable per-hour costs, so it’s possible to optimize, but the fact remains that store will become an ongoing, and growing expense, given expanding data volumes.
In terms of the “sticker price” associated with the storage alone, it will always be more cost effective to buy local storage once and store your data there (though this will often not be the case if considering the total cost of ownership, including engineering and IT costs). Many organizations managing huge data volumes are exploring hybrid cloud strategies, to enable them to keep some storage on-premises, while keeping other data, typically requiring more frequent analysis, in the cloud.
Self-Service Analytics: A Missing Piece in Cloud Data Lake Offerings
Analytics is the primary reason most organizations set up a data lake. And while data lakes in the cloud are easier to set up and maintain, connecting the dots from data ingested to a data lake, to a complete analytics solution, remains a challenge.
The cloud simplifies many aspects of data infrastructure and provides convenient managed services, but simply moving all your data to the cloud will not magically remove the complexity associated with analytics. This holds true whether you choose a database or data lake approach.Running your data lake in the cloud allows you to rely on secure and robust storage by providers such as AWS and Azure, which removes the need to constantly fiddle with on-prem Hadoop clusters. However, none of the cloud providers currently offer a way to ‘operationalize’ the data stored in your lake. They don’t provide self service options for:
- Running analytical queries on the data lake
- Building data-driven apps on top of the data lake
- Organizing or mining data in the data lake
These tasks remain complex and will still require you to stitch together code-intensive components, such as Spark, MapReduce, and Apache NiFi.
However, as the big data ecosystem matures, a new breed of self-service tools is emerging. Upsolver’s data lake platform falls into this category. These tools provide an actual self-service experience when analyzing data stored in cloud data lakes.
We believe more organizations will seek self-service analytics solutions, as data lakes are used in a broader range of organizations and use cases. To learn more, you can read Upsolver CTO Yoni Iny’s in-depth technical whitepaper: A Roadmap to Self Service Data Lake in the Cloud.
Move to the Cloud and Set Up Analytics Instantly with Upsolver
Modern cloud-based data lake architectures provide managed infrastructure. However, you will have to invest major efforts, and rely on specialized data engineers, to derive insights and set up analytics for data lake data.
Upsolver is an end-to-end platform for ingesting data into a data warehouse and enabling standard, SQL-based analytics, including real-time analytics. Specially designed for streaming data, Upsolver helps you organize data in your data lake in a way that facilitates flexible, high performance analysis with tools like Amazon Athena. It’s the only way to get analytics set up in your data lake within minutes.