MySQL CDC and Database Replication for the Data Lake Age

MySQL CDC and Database Replication for the Data Lake Age

Learn how to create query-able tables from your operational databases with Upsolver Change Data Capture (CDC).

Data replication has been around almost since the birth of databases. Early uses included integration between heterogeneous systems and data centralization (via the enterprise data warehouse). 

Change Data Capture (CDC) is not a recent invention, either. It was devised to identify and capture changes happening in the database. Much has already been said about it, and probably a lot more will be said, but the main idea remains the same: CDC is an efficient way to unsilo and democratize an organization’s data, so the enterprise can derive business value from its data,, rather than just hoarding it or keeping a copy to please the auditors. Heck – in a lot of cases CDC has even brought enterprise teams closer, so instead of fighting about the scope of responsibilities, people actually change their mindset to start collaborating to maximize the value of their shared data.

What is fresh is the paradigm shift from a locked and heavily guarded data warehouse (DW) to a data lake (DL), or these together in a Data Lakehouse architecture. Moving away from batch-oriented data distribution to streaming data pipelines further makes possible more ways of leveraging this data, with both change propagation and event-based architectures built on top of the same stream. Thus a CDC capability becomes essential to any modern data platform. 

In this post we discuss the different CDC approaches and patterns, and cover key Upsolver CDC capabilities that make it an indispensable tool in high-performance data pipelines. 

Why replicate data?

One of the most compelling and frequent use cases for data replication is, of course, streaming changes from OLTP systems to a single true source of data for downstream manipulation or distribution, while protecting OLTP system SLAs by avoiding intrusive interaction from analytics queries. This was the primary use case during the good old DW days, and it remains relevant in the age of the data lake / lakehouse. This approach appeals to mature organizations with legacy core systems because the data, methodology, and expected outcomes all are usually well known in advance.  This helps organizations plan ahead and significantly shorten data related projects’ time-to-market.  Financial organizations are good examples.  They tend to have multiple siloed data sources, yet must provide  data-driven insights into numerous lines of business while maintaining a secure and fraud-free environment. 

Other digital age companies face a different challenge. Software applications are released during rapid development cycles, and more and more data is generated every day as customers use the apps.  Development teams instantiate new databases and create new tables every day, and people immediately adopt new features, but the business may lag behind in analyzing these new and rapidly-multiplying data points. Frequently, companies must replicate the contents of entire OLTP DBs before they can find an actual use for the data, which  data scientists can mine while minimizing the impact on the source systems. 

For example, in the gaming industry, improving the game experience is an iterative process; each successive iteration generates more data, potentially dictating a different approach to data for the next iteration.

Comparing database replication approaches

CDC is one of multiple data replication methods. Furthermore there are several flavors of CDC. They are all oriented to a single purpose – to determine and track what has changed so action can be taken using the changed data. Of course we can react based only on the changes, but to obtain a complete view of data evolution in time we need to get the data baseline (“snapshot”) and then apply to it all the changes that have occurred within a given timeframe. 

But for now let’s put the data baseline alignment aside and compare different change data replication approaches:

Delta by comparison

Extract the changes by unloading the current snapshot of a source table and comparing it to an earlier snapshot.

Pros:

Gets the job done eventually.

Cons:

  • Slow
  • Compute-intensive
  • Labor-intensive
  • High impact on the source system
  • Schema is not provided within the data, but rather as a separate external artifact, making it difficult to parse the data automatically

Timestamp based CDC

Selecting all records with a “last modified” timestamp greater than the base value passed to the query.

Pros:

  • Faster and less labor intensive than the “delta” approach.
  • Easy to extract the changes
  • Supports D|U|I

Cons:

  • Compute-intensive
  • Does not support deletes
  • Requires an extra Timestamp column and associated index for optimal performance
  • Schema obtained via manual DB queries
  • Compute-intensive – each DML happening on the source table has an extra DML penalization caused by corresponding change tracking table inserts plus trigger execution overhead
  • Labor-intensive to maintain the setup
  • Very high impact on the source system
  • Schema should be obtained via manual DB queries

Transaction Log-based CDC

DB Transaction Log API returning a list of table data changes happening after the base timestamp (or LSN/RBA/SCN/… – relative offset of a log record in the log)

Pros:

  • Easy to extract the changes
  • Supports D/U/I
  • Contains schema within the data
  • Non-intrusive

Cons:

  • Requires access to DB Transaction Log

Architecture

Now that we’ve reviewed CDC methods, let’s look into the Upsolver approach. Upsolver is a low-code/no-code data lake automation platform. It enables you to ingest data into a data lake to create a single immutable source of true data, which can either:

  • serve as a final destination;
  • become a central data hub for direct access by query engines; or
  • serve purpose-built data stores with pre-processed data optimized for query performance (both in size and in format). 

You can perform complex transformations on the incoming data, including aggregations, joins, nested queries, and the like. All of this is accessible through a visual SQL IDE that enables you to design pipelines using interactive operations or assisted SQL code editor. The Upsolver platform automatically handles data lake performance optimizations such as compaction, partitioning, columnarization, and orchestration. 

When you use Upsolver for CDC, the system takes a complete source data snapshot and stores it in the data lake.  When this is complete a new CDC stream starts replicating changes to the source data into the data lake raw zone. From there you can transform the data and write it to the desired data store. Transformations range from simple data cleansing and normalization to complex operations that ensure consistent Operational Data Store (ODS)- or Historical Data Store (HDS)-like results on the target system. The data lake itself could also be the data target, thanks to Upsolver’s ability to perform upserts and deletes directly into the data lake. 

CDC methods

Replication Patterns

Having covered some examples, let’s explore common data replication patterns. Using the right tools for the job greatly improves your experience, helping you maintain a positive attitude towards your assignment. Data replication patterns are those tools:

  • Unidirectional replication
  • Bi-directional
  • Multi-Directional
  • Consolidation
  • Broadcast
  • Cascading data propagation. 

Bi-directional and multi-directional replication are patterns of specialty data integration tools used mostly for database mirroring, where clustering is either unavailable or impossible due to technological overlay.  Other patterns are used in more generic scenarios to distribute data in 1-to-1, 1-to-N, or N-to-1 patterns. 

Upsolver supports the following patterns:

1. Unidirectional

Unidirectional

Description: Read changes from a single source system and write them to a single target system

Usage: Propagate source changes to a single target for further processing (HDS, ODS, event-based architecture)

2. Broadcast / Cascading

Broadcast / Cascading

Description: Read changes once / and apply to multiple targets

Usage: Propagate source changes to multiple targets for further processing to enable multiple usages in parallel.

 

3. Consolidation

Consolidation

Description: Replicate changes from identical or similar schemas on different sources to a single schema on a single target

Usage: Consolidate changes from multiple hosted systems into a single set of tables to facilitate simpler analysis.

How Upsolver makes log-based CDC simple and powerful

This section explores the differentiating features of Upsolver that are so highly valued by our customers: 

Dynamically Split Tables on Key Field

Upsolver’s speciality is data stream processing.  We’ve leveraged this to create a unique, simple and powerful CDC feature. Upsolver can dynamically split a single data stream into multiple tables based on the values of a key field, which could either be a field in the original data stream or added as a function of inbound data/metadata.  Upsolver customers in data-intensive industries such as gaming and digital ads apply this capability in non-CDC use cases, but it also applies to CDC streams.

UPSERTs on the Data Lake

Traditional CDC tools can land database changes as raw data in a data lake. But they cannot maintain table consistency in the lake without your building a separate process. Merging updates/deletes is challenging since the data in the lake isn’t indexed – it is only partitioned by time. Upsolver automates this process. 

Rewind-Replay 

With Upsolver, raw data ingested from the source system can be stored indefinitely or retained for a predetermined duration. When using data replication, Upsolver keeps the raw data – both the initial baseline snapshot as well as all changes since that snapshot. This is another paradigm shift when compared to other data integration products on the market.  It enables Upsolver (or any other solution accessing the democratized S3 storage) to “Rewind-Replay” all of the changes from the initial “snapshot” as many times as necessary when updating the data pipeline logic or when directing the data into an alternative output/target for different use cases. And Upsolver accomplishes this without accessing the data on its original database source. 

Schema Evolution

No modern CDC discussion is complete without mentioning schema evolution. Upsolver CDC detects and replicates changes to schema in the source system. With a simple SELECT * statement (yes – contrary to all known best practices Upsolver’s SELECT * is recommended and required to successfully implement schema evolution) any columns created while the CDC process is running is automatically reflected on the data lake replica. This is possible due to Upsolver’s schema-on-read capability. If a column has been dropped in the source database, it is maintained on the target and its data nullified for all the new records.

Summary

In this article we’ve discussed high-level concepts related to CDC and drilled into Upsolver’s unique capabilities related to CDC-based database replication.  In the next BLOG post dedicated to CDC we will discuss the details of setting CDC up in Upsolver and show how easy it is to unsilo and democratize your organization’s data while optimizing the overall costs associated with data operation. 

Next steps

Want to learn more about Upsolver CDC?

Share with your friends

Learn more about Upsolver

Visit our homepage
data lake ETL Demo

Start for free with the Upsolver Community Edition.

Build working solutions for stream and batch processing on your data lake in minutes.

Get Started Now