Tutorial Data Engineering

Change Data Capture explained

SkyDeLake Jun 28, 2026 4 min read 0 views

Data Engineering Fundamentals

View full path →
  1. What data engineers actually do
  2. The anatomy of a data pipeline
  3. Warehouses, lakes, and lakehouses — what they are and when each wins
  4. Why Parquet? File formats explained
  5. Partitioning data at scale
  6. Predicate pushdown and column pruning — how query engines skip work
  7. Delta Lake, Iceberg, Hudi — open table formats explained
  8. Extracting data from anywhere — APIs, databases, and files
  9. Change Data Capture explained
  10. Batch vs streaming — choosing the right ingestion model
  11. SQL for data engineers
  12. dbt from first principles
  13. The medallion architecture — Bronze, Silver, Gold
  14. Idempotency — the most important property a data pipeline can have
  15. Orchestration fundamentals

Incremental extraction with a high-water mark works well — until it doesn't. If a row is updated without changing the updated_at timestamp, you miss it. If a row is hard-deleted, you never know it's gone. If the source database doesn't have a reliable timestamp column at all, the whole pattern breaks down.

Change Data Capture (CDC) is the answer to these problems. Rather than querying the source database, CDC reads the database's internal change log directly — capturing every insert, update, and delete as it happens, with perfect fidelity.

Why polling doesn't scale

Timestamp-based polling has three structural problems:

  1. Missed deletes. When a row is hard-deleted (DELETE FROM orders WHERE id = 123), it's gone. There's nothing to query. Your data warehouse still has the row; your pipeline has no way to know it should be removed.
  2. Missed updates with no timestamp. Not all source tables have an updated_at column. Even when they do, not all applications reliably update it.
  3. Load on the source. Regular polling queries — especially on large tables — put continuous read load on the production database, even when nothing has changed.

CDC solves all three.

What CDC actually reads: the database write-ahead log

Every major relational database maintains a Write-Ahead Log (WAL) — a sequential record of every change made to the database, written before the change is applied to the actual data. The WAL is how databases achieve crash recovery: if the server crashes mid-transaction, the log shows what needs to be replayed or rolled back on restart.

Log-based CDC reads this WAL as a stream of change events:

WAL events (logical replication format):
{op: "INSERT", table: "orders", after:  {id: 101, customer_id: 5, amount: 99.00, status: "pending"}}
{op: "UPDATE", table: "orders", before: {id: 101, status: "pending"}, after: {id: 101, status: "shipped"}}
{op: "DELETE", table: "orders", before: {id: 101, customer_id: 5, amount: 99.00}}

Every event has: the operation type (INSERT/UPDATE/DELETE), the table, and the full before and after state of the row. This is complete change information — nothing is missed.

How to enable CDC on PostgreSQL

PostgreSQL's built-in logical replication makes CDC straightforward to set up:

-- Enable logical replication (requires restart)
-- In postgresql.conf:
-- wal_level = logical

-- Create a replication slot
SELECT pg_create_logical_replication_slot('cdc_slot', 'pgoutput');

-- Create a publication (which tables to capture)
CREATE PUBLICATION my_publication FOR TABLE orders, customers, products;

A replication slot is a cursor in the WAL — the database keeps WAL segments available until the slot has consumed them. This means the CDC consumer can fall behind and catch up without losing events.

Debezium: the standard CDC tool

Debezium is the most widely used open-source CDC connector. It connects to a database's replication stream, reads change events, and publishes them to a Kafka topic — one topic per source table.

{
  "name": "postgres-cdc-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "replicator",
    "database.password": "secret",
    "database.dbname": "production",
    "database.server.name": "prod",
    "table.include.list": "public.orders,public.customers",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "dbz_publication"
  }
}

Once running, Debezium produces events to Kafka topics named prod.public.orders and prod.public.customers. Downstream consumers (a Flink job, a Python consumer, a Kafka Connect sink) read from these topics and apply the changes to the destination — updating rows on INSERT/UPDATE, deleting rows on DELETE.

Handling CDC in the destination

Applying CDC events to a data warehouse requires implementing the same semantics as the source database:

def apply_cdc_event(event: dict, dest_conn):
    op = event['op']

    if op == 'c':  # create / insert
        dest_conn.execute(
            "INSERT INTO orders VALUES (%s, %s, %s, %s)",
            event['after'].values()
        )
    elif op == 'u':  # update
        dest_conn.execute(
            "UPDATE orders SET amount=%s, status=%s WHERE id=%s",
            (event['after']['amount'], event['after']['status'], event['after']['id'])
        )
    elif op == 'd':  # delete
        dest_conn.execute(
            "DELETE FROM orders WHERE id=%s",
            (event['before']['id'],)
        )

In practice, deletes in data warehouses are handled carefully. Rather than physically deleting rows (which can violate audit requirements), many teams implement soft deletes in the destination: add a _deleted boolean and a _deleted_at timestamp, set them on DELETE events, and filter them out in the transformation layer.

When to use CDC vs polling

CDC is powerful but adds operational complexity — you need Kafka (or an alternative), a connector process, and replication slots managed on the source.

Use CDC when:

Use polling when:

Managed tools like Fivetran and Airbyte implement CDC behind the scenes for sources that support it — using it automatically when available and falling back to polling when not. This is often the right starting point before building custom CDC infrastructure.