The following is an excerpt from our recent white paper The Roadmap to Self-service Data Lakes in the Cloud. You can get the full guide right now for free, if you’re so inclined.
Databases are great. Organizations have been using them since the 1970s, there are dozens of open-source and proprietary products to choose from, and there’s a fairly mature ecosystem of BI, ETL and data visualization tools that allow users of various skill levels to access and understand the data stored in them using SQL or visual interfaces.
Because databases are so ingrained into the way most organizations think about data, when it comes time to build a streaming data architecture, the natural tendency is to assume this is just a matter of selecting the right database to handle the increased workloads that come with stream processing.
Let’s look at a few of the challenges that you’re likely to encounter using a database for streaming data, and how a data lake approach could be helpful in abating said challenges.
Why Streaming Data is Inherently Different from Non-streaming Data
Streaming data is not merely big data - it is also fundamentally different data, produced by different sources and requiring its own set of tools to handle. Accordingly, the solution is rarely as simple as getting a bigger database.
Traditional organizational data originates from various operational systems such as ERP, CRM, finance and HR systems-of-record; streaming data is typically produced by sources such as industrial sensors, clickstreams, servers and user app activity. This creates several core differences between streaming and traditional data, each of which pose challenges when trying to work with this data in traditional relational database models.
1. Non-static data
A traditional dataset typically captures a clearly delineated time and space, e.g.: number of items currently in stock, new employees hired in the past year; whereas streaming data is generated constantly, delivered in a continuous stream of small files, capturing event-based interactions on a second-by-second basis, e.g.: servers relaying their current state of operation, or a log of user activity on a mobile app. Querying this data in batch ETL processes means creating an arbitrary start and end point and creates challenges around data freshness, missing records and synchronization.
2. Unstructured versus tabular
Due to the high velocity in which event-based data is produced, and its highly disparate nature, it will be stored as objects (e.g. JSON files) rather than tables. The relational databases that power almost every enterprise and every application are built on tabular storage; using them to store unstructured data requires lengthy cleansing and transformation processes, creating engineering bottlenecks on ingest.
3. Experimental versus standardized use cases
Common data sources lend themselves to tried-and-tested forms of analysis and reporting; the potential of streaming data is unlocked through exploratory techniques, predictive modeling and machine learning. Applying these analyses requires broader and more flexible access to data, and can be hindered by the need to structure data according to existing enterprise data warehouse specifications.
4. Storage is a major cost factor
Databases can usually get away with using expensive storage with multiple redundancies, since the size of the data is relatively small and the compute and licensing costs outweigh the cost of storage. When you’re storing big data in a database where storage and compute are closely coupled (such as Amazon Redshift), storage costs can easily dwarf all other project costs.
Why Data Lakes are a Popular Alternative to Databases for Event Streaming
The unique characteristics of streaming data lead organizations to accept the limitations of relational databases for storing and analyzing streams; instead, recent years have seen a growing prominence of data lake architectures.
In this framework, the organization foregoes the attempt to ‘tame’ streaming data as it is being ingested, instead adopting the principle of store now, analyze later: records are stored in their raw, unstructured and schema-less form in a distributed file system such as HDFS, or in cloud object storage such as Amazon S3. The complicated process of transforming it into workable form is postponed until the time the need arises to do so, in order to satisfy a new business application or answer an analytical query.
Data lakes present several advantages for streaming data architectures:
- Easy to ingest data: the data lake approach removes the need for the complex and expensive ETL processes that are prerequisite to storing streaming data in a relational database; instead, a relevant subset of the data can be transformed and wrangled as needed to address the particulars of the use case that is being developed. This reduces much of the upfront costs of working with big data.
- Ability to support a large range of use cases: Data lakes provide vastly greater flexibility for developing new data-driven applications, especially for innovative and exploratory analyses that rely on access to data in its original form. Since invariably transformations into a rigid schema cause loss of connections within the data, exploratory analysis often finds itself facing the brick wall of not having retained the necessary information for that specific question; the data lake approach sidesteps this issue by storing all the the raw data in its original form, making it easier to answer questions based on historical data.
- Reduced storage costs: In data lakes, storage is decoupled from compute and can rely on inexpensive object storage in the cloud or on-premises. Since compute resources are dramatically more costly than storage, data lakes can significantly reduce infrastructure costs when working with large amounts of data compared to storing the same amount of data in a database. This comes in addition to the savings associated with removing the need for data transformation and wrangling upon ingest.
Why Most Data Lake Projects Fail, and What to Do About It
Nevertheless, data lakes are not a silver bullet; while they serve to reduce complexity in storing data, they also introduce new challenges around managing, accessing and analyzing data, often becoming massive engineering undertakings without clear results for the business. Many companies can circumvent these obstacles with a self-service streaming data platform.
We cover this topic and others in the full version of this article, which spans over 30 pages of in-depth analysis. We will be publishing the next chapter in two weeks’ time; but if you don’t feel like waiting, you can get the full copy right now by following this link.