Change Data Capture is not a new concept, and has been a part of database and data warehouse management for nearly as long as they have been around. The basic concept is that software processes are used to track and identify changes to data from a defined point in time, and this changed data is replicated to another system, such as a data warehouse, allowing the enterprise to act upon those changes. Rather than acting upon an entire dataset, the database or data warehouse can limit resource expenditure to a subset of data – the data that has changed. By focusing efforts on the changed data, efficiency is increased, both in terms of system resources for obtaining the data, and in terms of obtaining actionable data quickly.
Whether capturing streaming data, using an ETL (extract, transform, load) solution, or any high velocity data management solution, Change Data Capture (CDC) can be an excellent fit to achieve low-latency, reliable, and scalable data replication. By capturing a baseline snapshot, monitoring data delivery via logs, and replicating changes as they occur, system resources are not wasted on rewriting or transferring unnecessary data. CDC can improve efficiency for cloud migration, data warehousing, database queries, and more.
Besides efficiency, CDC is also a way to unsilo and democratize an organization’s data, so that business value can be derived, rather than simply gathering the data for retention purposes. By focusing on data changes, data validation and transformation efforts can be more focused, as well as bringing teams together on a shared definition of what constitutes data value.
In the simplest form, one computer system obtains or has data changed from a previously defined point of time, and another system (or sub-system) must take action based on this change in data. In other words, there is a source and a target, even if technically it can be physically within the same system. Change Data Capture is typically associated with SQL or MySQL Server workloads, but can theoretically be applied to any type of data.
The method by which changes are logged can be a key difference in Change Data Capture methodology. Timestamps, table differencing, event programming/ triggers, and log scanners are just some of the ways that data changes can be detected.
This method compares a snapshot of table data with a previous snapshot, row by row, column by column, to determine system changes. This is an effective methodology, but time consuming and resource intensive.
CDC for SQL or other tabular data often have columns that serve as a timestamp, such as Last_Updated or Date_Modified. For a Change Data Capture SQL server this can serve as a marker for changed data – ie, data received or changed beyond this timestamp can be replicated and/or acted upon by the target system.
Changes to an application are often triggered, and such triggers can also serve as identifiers for CDC. Using such triggers typically involves programming, and can provide more accurate and desirable results. Triggers can be things like ‘only after COMMIT’ or only after certain columns are changed to certain specified values, whatever the target system is programmed to look for. For SQL based CDC this might require changing the schema to add a datetime field to indicate when the record was created or updated, a version number, or a boolean status indicator.
Transaction Log based changes often utilize a time-stamp as a basis for CDC, or a relative offset of a log record within the log. This method is unobtrusive, and makes it simple to extract changes, and typically includes the schema within the data, making it simpler to reformat the data for later use. However, this would typically require access to the transactional log, as well as an API to retrieve the data into a usable format.
Upsolver favors a transactional log approach, for the reasons mentioned above: low impact, simple extraction, and that the schema is included, making reformatting data all the simpler. While the destination can be nearly anything, Upsolver is optimized for data lakes, allowing you to create a single, immutable source of truth for your organization. Data lakes ingest data in any number of formats, retaining the structure of the data as it is received.
Upsolver takes a full scale snapshot of the data source, and stores it in the data lake. Once complete, a CDC stream is begun, replicating changed data from the source to the data lake’s raw zone.