Tutorial Data Engineering

The medallion architecture — Bronze, Silver, Gold

SkyDeLake Jun 28, 2026 5 min read 0 views

Data Engineering Fundamentals

View full path →
  1. What data engineers actually do
  2. The anatomy of a data pipeline
  3. Warehouses, lakes, and lakehouses — what they are and when each wins
  4. Why Parquet? File formats explained
  5. Partitioning data at scale
  6. Predicate pushdown and column pruning — how query engines skip work
  7. Delta Lake, Iceberg, Hudi — open table formats explained
  8. Extracting data from anywhere — APIs, databases, and files
  9. Change Data Capture explained
  10. Batch vs streaming — choosing the right ingestion model
  11. SQL for data engineers
  12. dbt from first principles
  13. The medallion architecture — Bronze, Silver, Gold
  14. Idempotency — the most important property a data pipeline can have
  15. Orchestration fundamentals

You can have good ingestion, good storage, and good transformation logic — and still build a data platform that nobody trusts. The most common reason: nobody can tell where data is in the pipeline, which version of a table is which, or what "clean" actually means.

The medallion architecture is the standard answer to this problem. It organises data into three named layers — Bronze, Silver, Gold — each representing a different level of quality and purpose. It's not a tool or a specific technology; it's an organisational pattern for thinking about what happens to data as it moves through a pipeline.

The three layers

Bronze — raw data, preserved exactly as received

Bronze is the landing zone. Data arrives from source systems and is written to the Bronze layer exactly as it came — no transformations, no type casting, no schema normalisation. The only additions are metadata columns that the ingestion system adds: when the record was loaded, where it came from, what the source file was.

bronze/
├── source=postgres_orders/
│   └── date=2024-01-15/
│       └── load_timestamp=2024-01-15T06:00:00Z/
│           └── part-0001.parquet
├── source=stripe_events/
│   └── date=2024-01-15/
│       └── part-0001.json

The rule: Bronze is immutable and append-only. You never rewrite Bronze data. If the source system sent you bad data, that's in Bronze. If your schema changed, both old and new schema versions are in Bronze. This makes Bronze the ground truth — if anything goes wrong downstream, you can always replay from Bronze.

Silver — cleaned, validated, conforming

Silver is where data becomes trustworthy. Transformations applied at the Silver layer:

Silver models are entity-centric: you have a silver_customers table, a silver_orders table, a silver_products table. They represent what these entities actually are — clean, validated, with a consistent schema — not what any particular source system happened to store.

-- Silver layer: clean orders
SELECT
    id                                           AS order_id,
    customer_id,
    UPPER(TRIM(status))                          AS status,
    CAST(amount AS DECIMAL(10,2)) / 100          AS amount_usd,
    CAST(created_at AS TIMESTAMP)                AS ordered_at,
    _source_loaded_at                            AS bronze_loaded_at
FROM bronze.orders
WHERE id IS NOT NULL
  AND amount > 0
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY _source_loaded_at DESC) = 1

Gold — business-ready aggregates and models

Gold is what the business uses. It contains fact tables, dimension tables, summary aggregations, and business metric calculations. Gold models are built for consumption — by BI tools, dashboards, analysts running ad-hoc queries, or data scientists pulling training features.

Gold doesn't store raw facts — it stores answers to business questions:

-- Gold layer: daily revenue fact table
SELECT
    DATE_TRUNC('day', o.ordered_at)  AS order_date,
    c.country,
    c.plan_tier,
    COUNT(DISTINCT o.order_id)        AS order_count,
    COUNT(DISTINCT o.customer_id)     AS unique_customers,
    SUM(o.amount_usd)                 AS total_revenue_usd
FROM silver.orders o
JOIN silver.customers c USING (customer_id)
WHERE o.status = 'COMPLETED'
GROUP BY 1, 2, 3

Why this pattern works

The medallion architecture works because each layer has a clear contract:

Bronze never lies about what the source sent. If there's a dispute about what data was received, Bronze settles it. If ingestion fails, you know exactly which records are missing — nothing was overwritten.

Silver never guesses about business rules. It cleans and validates, but does not apply business logic. "Total revenue" is not a Silver concept. "Amount in the currency the source sent, cast to decimal" is.

Gold never does cleaning. If a Gold model needs to apply a filter like WHERE amount > 0, that filter belongs in Silver. Gold tables should be able to trust that their inputs are already clean.

This separation means bugs have a layer. If a KPI is wrong, you check: is the raw data wrong (Bronze problem)? Is the cleaning logic wrong (Silver problem)? Is the business logic wrong (Gold problem)? You don't have to read a thousand-line SQL script to find out.

Reprocessing from Bronze

The most valuable property of the medallion architecture is reprocessability. When a Silver or Gold transformation has a bug, you don't re-extract from source systems. You re-run the transformation from Bronze:

Bronze (immutable) → rerun Silver transformation → rerun Gold models

This is fast (Bronze is already in your data lake), cheap (no API calls or database load on source systems), and complete (Bronze has everything, including schema versions from before the bug).

Implementing with dbt and Delta Lake

In practice, most teams implement this pattern with dbt for transformations and Delta Lake (or Iceberg) for storage:

models/
├── staging/     → Bronze-to-Silver transformations
└── marts/       → Silver-to-Gold transformations

Storage:
s3://data-lake/
├── bronze/      → raw Parquet/JSON files (written by ingestion, not dbt)
├── silver/      → Delta tables managed by dbt
└── gold/        → Delta tables managed by dbt

Bronze is typically written by ingestion tools (Airbyte, Fivetran, custom scripts) outside of dbt. Silver and Gold are dbt models — SQL transformations that run on each pipeline execution.

The naming is a red herring

Don't get attached to "Bronze / Silver / Gold" as the only valid names. Some teams call them "raw / trusted / refined", or "landing / curated / serving", or simply "layer 0 / layer 1 / layer 2". What matters is not the names but the discipline: each layer has a clear purpose, data only moves forward (never backward), and each layer can be fully reconstructed from the one before it.

The medallion architecture is the most widely adopted data organisation pattern precisely because it maps onto the natural structure of how data should flow: land it first, trust it second, use it third.