The anatomy of a data pipeline

Every data engineering project, at its core, is a pipeline. Before you can design one, optimise one, or debug one at 2am, you need a clear mental model of what a pipeline actually is — not the buzzword, but the physical thing: what stages it has, what happens at each one, and where things go wrong.

This post maps the anatomy. Everything else in this series builds on it.

The five stages

A data pipeline moves data from source systems to the place where it gets used. That journey has five distinct stages, each with its own job and its own failure modes.

Source systems
     ↓
Ingestion layer
     ↓
Raw storage (landing zone)
     ↓
Transformation layer
     ↓
Serving layer

Let's walk through each one.

Stage 1: Source systems

Source systems are where data originates. They were not built to share data — they were built to run the business. A production PostgreSQL database is optimised for fast transactional reads and writes, not for bulk analytical exports. A third-party SaaS tool exposes an API designed for application use, not for data engineering.

Common source types:

The data engineer's first problem is always: how do I get data out of this system without disrupting it, losing records, or duplicating them?

Stage 2: Ingestion

Ingestion is the act of extracting data from source systems and moving it into your infrastructure. This is harder than it sounds.

The two main ingestion patterns are:

Full extraction — pull everything, every time. Simple to implement, expensive at scale. Fine for small tables (under a few million rows) that don't change too fast.

Incremental extraction — pull only what changed since the last run. More efficient, but requires the source to expose a reliable way to identify changes — usually a timestamp column (updated_at) or a sequence number.

-- Incremental extraction using a high-water mark
SELECT *
FROM orders
WHERE updated_at > '2024-01-15 00:00:00'
ORDER BY updated_at ASC

The catch with incremental extraction: not all source systems have reliable updated_at columns, and hard deletes (where a row is physically removed) are invisible to timestamp-based extraction. This is why Change Data Capture — covered later in this series — exists.

Stage 3: Raw storage (the landing zone)

Before any transformation happens, data lands in raw storage exactly as it came from the source. No cleaning, no renaming, no business logic. This is often called the landing zone or Bronze layer.

Why keep raw data untouched?

Raw storage lives in object storage — Amazon S3, Google Cloud Storage, or Azure Data Lake Storage — in a structured folder hierarchy:

s3://my-data-lake/
  raw/
    source=postgres/
      table=orders/
        date=2024-01-15/
          orders_20240115_120000.parquet

Files are typically written in Parquet or JSON depending on the source. Storage is cheap. Keep everything.

Stage 4: Transformation

This is where raw data becomes useful data. The transformation layer takes the messy, inconsistent, sometimes wrong data from raw storage and produces clean, structured, analysis-ready tables.

Typical transformations include:

In modern stacks, this layer is almost always SQL, managed by dbt. dbt lets you write each transformation as a SQL SELECT statement, handles dependencies between models, runs tests on the outputs, and generates documentation automatically. More on this in Post 12.

-- A simple transformation: clean and standardise orders
SELECT
    id                                    AS order_id,
    customer_id,
    UPPER(TRIM(status))                   AS status,
    amount / 100.0                        AS amount_gbp,
    DATE(created_at)                      AS order_date
FROM raw.orders
WHERE id IS NOT NULL

Stage 5: Serving layer

The serving layer is where the data is finally consumed. Different consumers need different things:

The rule for serving layer design: optimise for how each consumer actually reads the data. A BI dashboard that always filters by date and country needs a table partitioned on those columns. A real-time application needs a serving database, not a warehouse query.

Where things go wrong

Understanding the anatomy also means knowing where failures concentrate:

Between source and ingestion: schema changes. A column gets renamed or a new nullable field appears. Ingestion jobs that assumed the old schema break or silently drop data.

In ingestion itself: partial failures. The job extracts 8 hours of data and crashes. Does re-running it produce duplicates? (This is why idempotency matters — Post 14.)

In transformation: business logic bugs. A revenue calculation uses the wrong currency conversion. The output looks valid. Nobody notices for three weeks.

In serving: stale data. The pipeline ran fine but the dashboard is showing yesterday's numbers because the serving table wasn't refreshed. Or worse: it was refreshed with partially loaded data.

Every stage has a failure mode. Good data engineering is mostly about building systems that fail loudly rather than silently — and that can recover cleanly when they do fail.

The map going forward

The rest of this series follows the pipeline from left to right. Storage and formats (Posts 4–7) cover what raw storage actually looks like and how data is physically structured. Ingestion (Posts 8–10) goes deep on extraction patterns. Transformation (Posts 11–13) covers SQL, dbt, and how to layer your models. Reliability (Posts 14–15) covers the engineering that separates a script from a production system.

Keep this anatomy in mind as a map. Every concept that follows lives somewhere on it.