Tutorial Data Engineering

Idempotency — the most important property a data pipeline can have

SkyDeLake Jun 28, 2026 5 min read 1 views

Data Engineering Fundamentals

View full path →
  1. What data engineers actually do
  2. The anatomy of a data pipeline
  3. Warehouses, lakes, and lakehouses — what they are and when each wins
  4. Why Parquet? File formats explained
  5. Partitioning data at scale
  6. Predicate pushdown and column pruning — how query engines skip work
  7. Delta Lake, Iceberg, Hudi — open table formats explained
  8. Extracting data from anywhere — APIs, databases, and files
  9. Change Data Capture explained
  10. Batch vs streaming — choosing the right ingestion model
  11. SQL for data engineers
  12. dbt from first principles
  13. The medallion architecture — Bronze, Silver, Gold
  14. Idempotency — the most important property a data pipeline can have
  15. Orchestration fundamentals

Pipelines fail. Servers crash mid-run. Network connections drop after writing half the records. A job gets killed and restarted automatically by the scheduler. These are not edge cases — they are the normal operating conditions of any production data system.

The question is not whether your pipeline will run twice, or fail halfway through. The question is: when that happens, will the data be correct?

Idempotency is the property that answers yes. An idempotent operation produces the same result no matter how many times it is executed with the same inputs. Run it once: correct result. Run it twice: same correct result. Run it after a crash and restart: same correct result.

What non-idempotency looks like

A pipeline that inserts rows without checking for duplicates is not idempotent:

# Non-idempotent: run twice → double the rows
def load_orders(orders: list[dict]):
    for order in orders:
        db.execute(
            "INSERT INTO warehouse.orders (order_id, amount, status) VALUES (?, ?, ?)",
            (order['id'], order['amount'], order['status'])
        )

If this job runs successfully, then crashes and reruns, every row gets inserted twice. Your revenue numbers are now doubled. Detecting and fixing this in a production table with millions of rows is painful, time-consuming, and often incomplete.

The idempotent alternative: INSERT ... ON CONFLICT

The simplest fix: use upsert semantics. Insert the row if it doesn't exist; update it if it does.

-- PostgreSQL: ON CONFLICT DO UPDATE (upsert)
INSERT INTO warehouse.orders (order_id, amount, status)
VALUES (:order_id, :amount, :status)
ON CONFLICT (order_id)
DO UPDATE SET
    amount = EXCLUDED.amount,
    status = EXCLUDED.status;
-- Snowflake / most warehouses: MERGE
MERGE INTO warehouse.orders AS target
USING staging.orders_batch AS source
    ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET
    amount = source.amount,
    status = source.status
WHEN NOT MATCHED THEN INSERT (order_id, amount, status)
    VALUES (source.order_id, source.amount, source.status);

Now re-running the job produces the same result every time. The second run finds existing rows and updates them in place rather than inserting duplicates.

Partition-based idempotency

For large batch pipelines that process data by date partitions, a clean idempotency pattern is to delete the entire partition and rewrite it:

def load_daily_partition(date: str, records: list[dict]):
    # Delete the partition first — no matter what was there before
    db.execute(
        "DELETE FROM warehouse.events WHERE event_date = ?",
        (date,)
    )

    # Then insert clean
    db.executemany(
        "INSERT INTO warehouse.events (event_id, event_date, event_type, user_id) VALUES (?, ?, ?, ?)",
        [(r['id'], date, r['type'], r['user_id']) for r in records]
    )

Run it once: correct partition. Run it twice: same correct partition. Run it after a failure mid-insert: deletes whatever partial data was written, reinserts clean. The operation is always self-correcting.

This pattern is standard in dbt's incremental models (with the unique_key config) and in Spark batch jobs that write to partitioned Delta/Iceberg tables using insertOverwrite semantics.

Exactly-once vs at-least-once processing

In streaming systems, there are three processing guarantees:

The practical strategy for most teams: design for at-least-once processing (simpler, more broadly supported) and make all downstream writes idempotent. The combination gives you exactly-once semantics in the output without the full complexity of exactly-once guarantees in the processing engine.

Idempotency keys

For APIs and event-driven systems, idempotency keys are the standard pattern. Each operation includes a unique key; if the same key is submitted twice, the server recognises the duplicate and returns the original response rather than processing it twice.

import hashlib
import json

def publish_event_idempotent(event: dict):
    # Generate a deterministic key from the event content
    idempotency_key = hashlib.sha256(
        json.dumps(event, sort_keys=True).encode()
    ).hexdigest()

    # Check if we've already processed this event
    existing = db.execute(
        "SELECT id FROM processed_events WHERE idempotency_key = ?",
        (idempotency_key,)
    ).fetchone()

    if existing:
        return {'status': 'already_processed', 'id': existing[0]}

    # Process and record
    result = do_process(event)
    db.execute(
        "INSERT INTO processed_events (idempotency_key, result_id) VALUES (?, ?)",
        (idempotency_key, result['id'])
    )
    return result

Testing idempotency

Every data pipeline should have an explicit idempotency test: run the pipeline, verify the output, run the pipeline again with the same input, verify the output is identical.

def test_load_is_idempotent():
    sample_orders = [
        {'id': 1, 'amount': 100, 'status': 'completed'},
        {'id': 2, 'amount': 200, 'status': 'pending'},
    ]

    load_orders(sample_orders)
    count_after_first = db.execute("SELECT COUNT(*) FROM orders").fetchone()[0]

    load_orders(sample_orders)  # run again with same input
    count_after_second = db.execute("SELECT COUNT(*) FROM orders").fetchone()[0]

    assert count_after_first == count_after_second, (
        f"Idempotency failure: {count_after_first} rows after first run, "
        f"{count_after_second} rows after second run"
    )

Idempotency is not just a property — it's a design discipline

The biggest mistake in data engineering is treating idempotency as an afterthought — something to add if there are duplicate problems. It needs to be designed in from the start.

Every write operation in a production pipeline should have an explicit answer to the question: what happens if this runs twice? If the answer is "we get duplicates", the pipeline is not production-ready. Not because failures are common, but because they are inevitable, and the cost of cleaning up a non-idempotent failure at scale is orders of magnitude higher than the cost of designing idempotency in.