Tutorial Data Engineering

dbt from first principles

SkyDeLake Jun 28, 2026 5 min read 0 views

Data Engineering Fundamentals

View full path →
  1. What data engineers actually do
  2. The anatomy of a data pipeline
  3. Warehouses, lakes, and lakehouses — what they are and when each wins
  4. Why Parquet? File formats explained
  5. Partitioning data at scale
  6. Predicate pushdown and column pruning — how query engines skip work
  7. Delta Lake, Iceberg, Hudi — open table formats explained
  8. Extracting data from anywhere — APIs, databases, and files
  9. Change Data Capture explained
  10. Batch vs streaming — choosing the right ingestion model
  11. SQL for data engineers
  12. dbt from first principles
  13. The medallion architecture — Bronze, Silver, Gold
  14. Idempotency — the most important property a data pipeline can have
  15. Orchestration fundamentals

Before dbt, the transformation layer was a mess. SQL scripts lived in random folders, in Jupyter notebooks, in stored procedures that nobody dared touch. Nobody knew which queries ran in which order, nobody tested whether outputs were correct, and documentation was whatever someone had typed in a comment three years ago.

dbt (data build tool) solved this. It turned SQL transformation into proper software development — with version control, dependency management, testing, and documentation built in. It's the most important tool in the modern data stack, and understanding it from first principles makes everything else in data engineering clearer.

The core idea: every transformation is a SELECT

In dbt, you write each transformation as a SQL SELECT statement. No CREATE TABLE, no INSERT INTO — just the logic. dbt handles the materialisation.

-- models/staging/stg_orders.sql
SELECT
    id           AS order_id,
    customer_id,
    amount / 100 AS amount_usd,
    status,
    created_at   AS ordered_at
FROM {{ source('postgres', 'orders') }}

When you run dbt run, dbt reads this file and executes:

CREATE TABLE stg_orders AS
SELECT
    id           AS order_id,
    customer_id,
    amount / 100 AS amount_usd,
    status,
    created_at   AS ordered_at
FROM raw.orders

The {{ source() }} is a Jinja template function. dbt uses Jinja extensively to add dynamic behaviour to SQL — referencing other models, calling macros, conditionally including logic.

The ref() function and the dependency graph

The most important dbt function is ref(). When one model depends on another, you reference it with ref('model_name') instead of the raw table name.

-- models/marts/orders_enriched.sql
SELECT
    o.order_id,
    o.amount_usd,
    o.ordered_at,
    c.segment,
    c.country
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_customers') }} c ON o.customer_id = c.customer_id

From these references, dbt builds a Directed Acyclic Graph (DAG) of all your models. This is how it knows what order to run things in. If orders_enriched references stg_orders, dbt ensures stg_orders runs first — automatically, without any manual orchestration configuration.

raw.orders ──→ stg_orders ──→ orders_enriched ──→ fct_revenue
                                    ↑
raw.customers → dim_customers ──────┘

You can visualize this graph with dbt docs generate && dbt docs serve — it produces an interactive DAG diagram of your entire transformation layer.

Materialisation types

By default, dbt creates tables (materialized='table'). But you can configure each model differently:

-- models/marts/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id'
    )
}}

SELECT
    order_id,
    customer_id,
    amount_usd,
    ordered_at
FROM {{ ref('stg_orders') }}

{% if is_incremental() %}
WHERE ordered_at > (SELECT MAX(ordered_at) FROM {{ this }})
{% endif %}

The {% if is_incremental() %} block only runs on subsequent runs (not the first build), adding a filter to process only new rows. {{ this }} is a special reference to the model's own table in the database.

Tests — the most underused feature

dbt has a built-in testing framework. Tests are YAML assertions that run against your model outputs after each run.

# models/staging/schema.yml
version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: amount_usd
        tests:
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'completed', 'cancelled', 'refunded']
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id

Run dbt test and each assertion runs as a SQL query. If any test fails, you get a clear error — and the downstream models that depend on this one haven't been built yet, so you catch data quality issues before they propagate.

The layered model structure

dbt projects that work well tend to follow a consistent three-layer structure:

models/
├── staging/     ← One model per source table. Rename columns, cast types, light cleaning.
│   ├── stg_orders.sql
│   ├── stg_customers.sql
│   └── schema.yml
├── intermediate/ ← Optional. Complex joins or business logic shared across marts.
│   └── int_customer_orders.sql
└── marts/       ← Business-facing output tables. Fact and dimension tables.
    ├── fct_orders.sql
    ├── dim_customers.sql
    └── schema.yml

The rule: each layer should only reference models in the same or earlier layers. Staging models pull from sources. Marts reference staging or intermediate models. No mart should reach into another mart's logic.

Sources and freshness

The {{ source() }} function does more than replace a table name. It registers raw source tables in dbt's metadata layer, enabling:

# models/staging/sources.yml
version: 2

sources:
  - name: postgres
    database: raw
    schema: public
    tables:
      - name: orders
        description: "Raw orders from the production PostgreSQL database"
        loaded_at_field: updated_at
        freshness:
          warn_after: {count: 6, period: hour}
          error_after: {count: 24, period: hour}

Why dbt matters beyond the tool itself

dbt enforces a discipline: transformations are code, code lives in version control, outputs have tests, and every model is documented. These aren't features of dbt — they're engineering practices that dbt makes the path of least resistance.

Adopting dbt means your transformation layer is auditable, testable, and maintainable in a way that a collection of SQL scripts never could be. That's the real value — not the templating or the CLI, but the engineering culture it creates.