Change data capture (CDC) is the process of capturing changes made at the data source level and applying them throughout the enterprise. CDC minimizes the resources required for ETL (extract, transform, load) processes because it only deals with data changes. The goal of CDC is to ensure data synchronicity.
There are four methods to handle Change Data Capture (CDC):
The timestamp column in the source table is used to capture the date and time of the last change, whether it’s a new entry or an update to an existing row.
Database triggers are added to the source tables so all changes (inserts, updates, deletes) are replicated in the second set of tables specifically used for the CDC process. Only the “changed” records that are captured in the CDC tables are used to update the data warehouse during the ETL process.
This is a simple technique where regularly scheduled table exports (“snapshots”) or staging tables are used to identify changed records. By calculating the difference between the current and previous snapshots, all new, updated or deleted records can be captured and loaded into the data warehouse.
Database applications can be configured to track all activity in log files. For CDC purposes, those application log files can be scanned and parsed (“scraped”) to identify when changes occur and capture those records.