Dictionary

Change Data Capture (CDC)

Change Data Capture (CDC) is the practice of detecting every change in a source system and forwarding it to downstream systems. It keeps your lakehouse or data warehouse close to real time without copying everything again.

What is Change Data Capture?

Change Data Capture, usually shortened to CDC, is the practice of detecting every change in a source system (insert, update, delete) and streaming it to other systems. Instead of reloading the full table every night, you send only what changed since the previous run. That scales better, puts less load on the source, and makes near real time synchronisation possible.

CDC shows up in nearly every modern data stack: replication between operational systems, feeding a lakehouse or data warehouse, syncing microservices, and real-time analytics on transactional data.

Think of CDC as a bookkeeper's daybook: instead of rewriting the whole ledger every evening, you note only the day's movements. Anyone who wants the current state reads the latest known version and applies the movements since.

Why CDC instead of a full load?

Full-load ETL works fine as long as tables stay small. Past a few hundred million rows, three problems start to bite.

Source pressure
A nightly full load pins an OLTP database for ten minutes. For 24/7 operations that is not an option.

Runtime
Full loads keep growing while the refresh window does not. The business asks for fresher data while IT adds more hours to the ingest.

Cost
Cloud platforms charge by the minute for compute. Incremental loads are always cheaper than full loads.

CDC fixes all of this by shipping only the deltas.

CDC methods

Log-based CDC
The database writes every change to an internal transaction log (WAL in PostgreSQL, binlog in MySQL, CDC tables in SQL Server). CDC tools read that log and publish every change as an event. Big upside: minimal impact on the source and no application changes needed. Tools like Debezium, SQL Server CDC, and Fabric Mirroring all work this way.

Trigger-based CDC
You place database triggers that write every insert, update, or delete to a changelog table. Works in any database, but slows writes and needs maintenance on every table involved.

Timestamp-based CDC
Every table has a last_modified column. Your pipeline pulls rows changed since the previous run. Simple and cheap, but misses deletes and breaks if someone forgets to update the column.

Snapshot differencing
You compare a fresh snapshot with the previous one and derive the deltas. Always works, but slow and heavy. Sometimes the last resort when the source allows nothing else.

How CDC fits a modern data stack

  1. Capture
    A CDC tool reads changes from the source. Mature connectors exist for SQL Server, Oracle, PostgreSQL, and MySQL.

  2. Transport
    Changes flow into a streaming layer like Apache Kafka, Azure Event Hubs, or Pulsar. That allows several consumers on the same stream.

  3. Sink
    Downstream systems subscribe: a lakehouse writes the changes as Delta tables, a search index updates its records, a microservice rebuilds a cache.

  4. Apply
    In the target, inserts are added, updates overwrite, and deletes are tombstoned or truly removed. Delta's MERGE INTO statement is a workhorse here.

CDC in Microsoft Fabric

Microsoft Fabric ships a dedicated feature called Mirroring that delivers CDC out of the box for Azure SQL Database, Cosmos DB, and Snowflake. Source changes appear as Delta tables in OneLake within seconds, without you building a pipeline. For other sources you can use the classic Data Factory connectors with an incremental load pattern, or Debezium via Event Streams.

When do you reach for CDC?

  • Feeding a warehouse or lakehouse with operational data without stressing the source database.

  • Real-time dashboards on transactional data, for example sales per minute.

  • Microservice synchronisation, where a service keeps its own cache current based on events from another service.

  • Fraud detection where seconds matter: a suspicious transaction demands an immediate response.

  • Migration to a new system without downtime, by doing a bulk load first and then letting CDC catch up the deltas until you can cut over.

Pitfalls

Schema changes break pipelines
Adding a column in the source does not flow through a CDC stream automatically unless your tool handles schema evolution. Plan data lineage checks in.

Missing deletes
Some CDC methods (especially timestamp-based) never capture deletes. Your target table keeps rows that have long since been removed from the source.

Exactly-once versus at-least-once
Streaming systems sometimes deliver messages more than once. Make your apply logic idempotent, for example through a unique change key.

Watch replication lag
Without monitoring you only find out late that the stream is falling behind. Set an alert on the delay between source commit and sink apply.

Last Updated: April 23, 2026 Back to Dictionary
Keywords
change data capture cdc etl elt data integration debezium sql server cdc fabric mirroring data engineering streaming kafka