Change Data Capture explained

Incremental extraction with a high-water mark works well — until it doesn't. If a row is updated without changing the updated_at timestamp, you miss it. If a row is hard-deleted, you never know it's gone. If the source database doesn't have a reliable timestamp column at all, the whole pattern breaks down.

Change Data Capture (CDC) is the answer to these problems. Rather than querying the source database, CDC reads the database's internal change log directly — capturing every insert, update, and delete as it happens, with perfect fidelity.

Why polling doesn't scale

Timestamp-based polling has three structural problems:

Missed deletes. When a row is hard-deleted (DELETE FROM orders WHERE id = 123), it's gone. There's nothing to query. Your data warehouse still has the row; your pipeline has no way to know it should be removed.
Missed updates with no timestamp. Not all source tables have an updated_at column. Even when they do, not all applications reliably update it.
Load on the source. Regular polling queries — especially on large tables — put continuous read load on the production database, even when nothing has changed.

CDC solves all three.

What CDC actually reads: the database write-ahead log

Every major relational database maintains a Write-Ahead Log (WAL) — a sequential record of every change made to the database, written before the change is applied to the actual data. The WAL is how databases achieve crash recovery: if the server crashes mid-transaction, the log shows what needs to be replayed or rolled back on restart.

Log-based CDC reads this WAL as a stream of change events:

WAL events (logical replication format):
{op: "INSERT", table: "orders", after:  {id: 101, customer_id: 5, amount: 99.00, status: "pending"}}
{op: "UPDATE", table: "orders", before: {id: 101, status: "pending"}, after: {id: 101, status: "shipped"}}
{op: "DELETE", table: "orders", before: {id: 101, customer_id: 5, amount: 99.00}}

Every event has: the operation type (INSERT/UPDATE/DELETE), the table, and the full before and after state of the row. This is complete change information — nothing is missed.

How to enable CDC on PostgreSQL

PostgreSQL's built-in logical replication makes CDC straightforward to set up:

-- Enable logical replication (requires restart)
-- In postgresql.conf:
-- wal_level = logical

-- Create a replication slot
SELECT pg_create_logical_replication_slot('cdc_slot', 'pgoutput');

-- Create a publication (which tables to capture)
CREATE PUBLICATION my_publication FOR TABLE orders, customers, products;

A replication slot is a cursor in the WAL — the database keeps WAL segments available until the slot has consumed them. This means the CDC consumer can fall behind and catch up without losing events.

Debezium: the standard CDC tool

Debezium is the most widely used open-source CDC connector. It connects to a database's replication stream, reads change events, and publishes them to a Kafka topic — one topic per source table.

{
  "name": "postgres-cdc-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "replicator",
    "database.password": "secret",
    "database.dbname": "production",
    "database.server.name": "prod",
    "table.include.list": "public.orders,public.customers",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "dbz_publication"
  }
}

Once running, Debezium produces events to Kafka topics named prod.public.orders and prod.public.customers. Downstream consumers (a Flink job, a Python consumer, a Kafka Connect sink) read from these topics and apply the changes to the destination — updating rows on INSERT/UPDATE, deleting rows on DELETE.

Handling CDC in the destination

Applying CDC events to a data warehouse requires implementing the same semantics as the source database:

def apply_cdc_event(event: dict, dest_conn):
    op = event['op']

    if op == 'c':  # create / insert
        dest_conn.execute(
            "INSERT INTO orders VALUES (%s, %s, %s, %s)",
            event['after'].values()
        )
    elif op == 'u':  # update
        dest_conn.execute(
            "UPDATE orders SET amount=%s, status=%s WHERE id=%s",
            (event['after']['amount'], event['after']['status'], event['after']['id'])
        )
    elif op == 'd':  # delete
        dest_conn.execute(
            "DELETE FROM orders WHERE id=%s",
            (event['before']['id'],)
        )

In practice, deletes in data warehouses are handled carefully. Rather than physically deleting rows (which can violate audit requirements), many teams implement soft deletes in the destination: add a _deleted boolean and a _deleted_at timestamp, set them on DELETE events, and filter them out in the transformation layer.

When to use CDC vs polling

CDC is powerful but adds operational complexity — you need Kafka (or an alternative), a connector process, and replication slots managed on the source.

Use CDC when:

You need to capture deletes
The source table doesn't have a reliable updated_at column
You need near-real-time data propagation (seconds, not minutes)
High change volumes make polling expensive

Use polling when:

Deletes don't matter or are handled by soft-delete patterns in the source
The source has a reliable timestamp column
15-minute or hourly freshness is acceptable
You want simplicity — a cron job that runs a SQL query is much easier to operate than a Kafka + Debezium stack

Managed tools like Fivetran and Airbyte implement CDC behind the scenes for sources that support it — using it automatically when available and falling back to polling when not. This is often the right starting point before building custom CDC infrastructure.