dbt from first principles

Before dbt, the transformation layer was a mess. SQL scripts lived in random folders, in Jupyter notebooks, in stored procedures that nobody dared touch. Nobody knew which queries ran in which order, nobody tested whether outputs were correct, and documentation was whatever someone had typed in a comment three years ago.

dbt (data build tool) solved this. It turned SQL transformation into proper software development — with version control, dependency management, testing, and documentation built in. It's the most important tool in the modern data stack, and understanding it from first principles makes everything else in data engineering clearer.

The core idea: every transformation is a SELECT

In dbt, you write each transformation as a SQL SELECT statement. No CREATE TABLE, no INSERT INTO — just the logic. dbt handles the materialisation.

-- models/staging/stg_orders.sql
SELECT
    id           AS order_id,
    customer_id,
    amount / 100 AS amount_usd,
    status,
    created_at   AS ordered_at
FROM {{ source('postgres', 'orders') }}

When you run dbt run, dbt reads this file and executes:

CREATE TABLE stg_orders AS
SELECT
    id           AS order_id,
    customer_id,
    amount / 100 AS amount_usd,
    status,
    created_at   AS ordered_at
FROM raw.orders

The {{ source() }} is a Jinja template function. dbt uses Jinja extensively to add dynamic behaviour to SQL — referencing other models, calling macros, conditionally including logic.

The ref() function and the dependency graph

The most important dbt function is ref(). When one model depends on another, you reference it with ref('model_name') instead of the raw table name.

-- models/marts/orders_enriched.sql
SELECT
    o.order_id,
    o.amount_usd,
    o.ordered_at,
    c.segment,
    c.country
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_customers') }} c ON o.customer_id = c.customer_id

From these references, dbt builds a Directed Acyclic Graph (DAG) of all your models. This is how it knows what order to run things in. If orders_enriched references stg_orders, dbt ensures stg_orders runs first — automatically, without any manual orchestration configuration.

raw.orders ──→ stg_orders ──→ orders_enriched ──→ fct_revenue
                                    ↑
raw.customers → dim_customers ──────┘

You can visualize this graph with dbt docs generate && dbt docs serve — it produces an interactive DAG diagram of your entire transformation layer.

Materialisation types

By default, dbt creates tables (materialized='table'). But you can configure each model differently:

table: Drop and recreate the entire table on every run. Simple and correct. Appropriate for most models unless the table is very large.
view: Create a database view — the SELECT runs at query time, not at dbt run time. Fast to build, but can be slow to query if the upstream tables are large. Good for lightweight staging models.
incremental: On first run, build the full table. On subsequent runs, only process new or changed rows and append/merge them. The right choice for large fact tables with millions of rows.
ephemeral: Never create a physical object — inject the model's SQL as a CTE into the models that reference it. Useful for intermediate logic you want to encapsulate without creating a table.

-- models/marts/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id'
    )
}}

SELECT
    order_id,
    customer_id,
    amount_usd,
    ordered_at
FROM {{ ref('stg_orders') }}

{% if is_incremental() %}
WHERE ordered_at > (SELECT MAX(ordered_at) FROM {{ this }})
{% endif %}

The {% if is_incremental() %} block only runs on subsequent runs (not the first build), adding a filter to process only new rows. {{ this }} is a special reference to the model's own table in the database.

Tests — the most underused feature

dbt has a built-in testing framework. Tests are YAML assertions that run against your model outputs after each run.

# models/staging/schema.yml
version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: amount_usd
        tests:
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'completed', 'cancelled', 'refunded']
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id

Run dbt test and each assertion runs as a SQL query. If any test fails, you get a clear error — and the downstream models that depend on this one haven't been built yet, so you catch data quality issues before they propagate.

The layered model structure

dbt projects that work well tend to follow a consistent three-layer structure:

models/
├── staging/     ← One model per source table. Rename columns, cast types, light cleaning.
│   ├── stg_orders.sql
│   ├── stg_customers.sql
│   └── schema.yml
├── intermediate/ ← Optional. Complex joins or business logic shared across marts.
│   └── int_customer_orders.sql
└── marts/       ← Business-facing output tables. Fact and dimension tables.
    ├── fct_orders.sql
    ├── dim_customers.sql
    └── schema.yml

The rule: each layer should only reference models in the same or earlier layers. Staging models pull from sources. Marts reference staging or intermediate models. No mart should reach into another mart's logic.

Sources and freshness

The {{ source() }} function does more than replace a table name. It registers raw source tables in dbt's metadata layer, enabling:

Source freshness checks: dbt source freshness checks whether source tables have been updated recently, alerting you when ingestion has failed without you needing to trace downstream symptoms.
Source-level documentation: describes what each raw table is and where it comes from.
DAG visibility: the lineage graph shows sources as entry points, making the full data flow visible from source to mart.

# models/staging/sources.yml
version: 2

sources:
  - name: postgres
    database: raw
    schema: public
    tables:
      - name: orders
        description: "Raw orders from the production PostgreSQL database"
        loaded_at_field: updated_at
        freshness:
          warn_after: {count: 6, period: hour}
          error_after: {count: 24, period: hour}

Why dbt matters beyond the tool itself

dbt enforces a discipline: transformations are code, code lives in version control, outputs have tests, and every model is documented. These aren't features of dbt — they're engineering practices that dbt makes the path of least resistance.

Adopting dbt means your transformation layer is auditable, testable, and maintainable in a way that a collection of SQL scripts never could be. That's the real value — not the templating or the CLI, but the engineering culture it creates.