SQL for data engineers

Data engineers work in SQL more than any other language. More than Python, more than Scala, more than anything else. Not because SQL is trendy, but because it is the most expressive language for data transformation — and virtually every data warehouse and lakehouse engine runs it natively.

This post is not SQL basics. It's the SQL that actually matters in a data engineering context: the patterns, functions, and approaches that appear constantly in real pipelines.

Window functions — the most important concept

If there's one SQL feature that separates analysts who write data pipelines from data engineers who build them, it's window functions. They perform calculations across a set of rows related to the current row, without collapsing those rows the way GROUP BY does.

The anatomy of a window function:

function_name() OVER (
    PARTITION BY column   -- group rows (like GROUP BY, but don't collapse)
    ORDER BY column       -- define order within the window
    ROWS/RANGE BETWEEN ... AND ...  -- optional: frame of rows to include
)

The most useful window functions in data engineering:

-- ROW_NUMBER: assign a sequential number within each group
-- Use case: deduplication — keep only the most recent record per customer
SELECT *
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
    FROM orders
) t
WHERE rn = 1;

-- LAG / LEAD: access the previous or next row
-- Use case: compute time between consecutive events
SELECT
    order_id,
    customer_id,
    created_at,
    LAG(created_at) OVER (PARTITION BY customer_id ORDER BY created_at) AS prev_order_at,
    DATEDIFF('day',
        LAG(created_at) OVER (PARTITION BY customer_id ORDER BY created_at),
        created_at
    ) AS days_since_prev_order
FROM orders;

-- SUM() OVER: running total
SELECT
    order_id,
    customer_id,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY created_at) AS cumulative_spend
FROM orders;

-- NTILE: divide rows into N equal buckets
-- Use case: segment customers into quartiles by spend
SELECT
    customer_id,
    total_spend,
    NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
FROM customer_totals;

CTEs — readable pipelines in SQL

Common Table Expressions (CTEs) let you break a complex query into named steps. This is how data engineers write SQL that other people can actually read.

WITH
-- Step 1: clean the raw orders
cleaned_orders AS (
    SELECT
        id AS order_id,
        customer_id,
        UPPER(TRIM(status))  AS status,
        amount / 100.0       AS amount_usd,
        DATE(created_at)     AS order_date
    FROM raw.orders
    WHERE id IS NOT NULL
),

-- Step 2: compute per-customer aggregates
customer_stats AS (
    SELECT
        customer_id,
        COUNT(*)         AS order_count,
        SUM(amount_usd)  AS total_spend,
        MAX(order_date)  AS last_order_date
    FROM cleaned_orders
    WHERE status = 'COMPLETED'
    GROUP BY customer_id
),

-- Step 3: segment customers
customer_segments AS (
    SELECT
        customer_id,
        total_spend,
        order_count,
        last_order_date,
        CASE
            WHEN total_spend >= 1000 THEN 'high_value'
            WHEN total_spend >= 200  THEN 'mid_value'
            ELSE 'low_value'
        END AS segment
    FROM customer_stats
)

SELECT * FROM customer_segments
ORDER BY total_spend DESC;

Each CTE is a named, reusable step. Unlike subqueries, CTEs are defined once and can be referenced multiple times in the same query. In dbt, each CTE step typically becomes its own model file — making the logic even more modular and testable.

Slowly Changing Dimensions (SCDs)

One of the most common data modelling challenges: how do you track how a record has changed over time? A customer's country, plan tier, or email address can change — but you often need to know what it was at the time of each transaction.

The standard approach is SCD Type 2: rather than overwriting the old record, you close it (add an effective_to date) and insert a new record.

-- SCD Type 2 customer dimension
CREATE TABLE dim_customer (
    customer_key   INT PRIMARY KEY,  -- surrogate key
    customer_id    INT,              -- natural key from source
    email          VARCHAR(255),
    plan           VARCHAR(50),
    country        CHAR(2),
    effective_from DATE NOT NULL,
    effective_to   DATE,             -- NULL means current record
    is_current     BOOLEAN NOT NULL DEFAULT TRUE
);

-- To find a customer's plan at order time:
SELECT
    o.order_id,
    o.customer_id,
    o.amount,
    o.order_date,
    c.plan AS plan_at_order_time
FROM orders o
JOIN dim_customer c
    ON  o.customer_id  = c.customer_id
    AND o.order_date   BETWEEN c.effective_from AND COALESCE(c.effective_to, '9999-12-31')
    AND c.is_current = FALSE;  -- include historical rows

NULL handling

NULL behaviour in SQL is one of the most common sources of silent bugs. Three rules to know:

Any arithmetic with NULL returns NULL. 100 + NULL = NULL. Use COALESCE(amount, 0) to substitute a default.
NULL comparisons always return false. WHERE status = NULL matches nothing. Use WHERE status IS NULL.
NULL in GROUP BY is treated as a single group. Multiple NULLs in a grouping column are grouped together — this is often not what you want.

-- Pattern: handle NULLs explicitly in aggregations
SELECT
    customer_id,
    COUNT(*)                                    AS total_orders,
    COUNT(discount_code)                        AS orders_with_discount,  -- NULLs excluded automatically
    SUM(COALESCE(discount_amount, 0))           AS total_discount,
    AVG(NULLIF(delivery_days, 0))               AS avg_delivery_days  -- exclude 0-day deliveries (same-day)
FROM orders
GROUP BY customer_id;

Date and time manipulation

Data engineers spend an enormous amount of time working with dates. The functions vary by engine, but the patterns are consistent:

-- Truncate to a period (group by week, month, quarter)
DATE_TRUNC('month', order_date)          -- Snowflake, BigQuery, PostgreSQL
DATE_TRUNC(order_date, MONTH)            -- BigQuery alternative syntax

-- Extract a component
EXTRACT(DOW FROM order_date)             -- day of week (0=Sunday)
DATE_PART('hour', event_timestamp)       -- hour of day

-- Add/subtract intervals
order_date + INTERVAL '7 days'
DATEADD('day', 7, order_date)            -- Snowflake

-- Calculate differences
DATEDIFF('day', signup_date, first_order_date)

-- Convert timestamps to dates (strip time component)
DATE(event_timestamp)

-- Working with time zones
CONVERT_TIMEZONE('UTC', 'Europe/London', event_timestamp)  -- Snowflake
AT TIME ZONE 'Europe/London'                               -- PostgreSQL

Joins and their subtle behaviours

Two join patterns that catch data engineers regularly:

-- Anti-join: rows in A that have no match in B
-- Useful for: finding customers who never ordered, finding gaps
SELECT c.customer_id
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

-- Fan-out on many-to-many joins: always check for row multiplication
-- If one customer has 3 addresses, joining on customer_id triples every row
SELECT
    o.order_id,
    o.amount,
    COUNT(*) OVER (PARTITION BY o.order_id) AS row_count  -- > 1 indicates fan-out
FROM orders o
JOIN customer_addresses a ON o.customer_id = a.customer_id;

Whenever writing a join in a pipeline, verify the grain of the result. If you expect one row per order and you're joining to a table that can have multiple rows per customer, you'll silently multiply your data.

Writing SQL for pipelines vs queries

A one-off analysis query can be clever and compact. SQL in a pipeline has to prioritise different things:

Explicit over implicit. Name every column instead of using SELECT *. Future schema changes will break the pipeline in a visible, controlled way rather than silently adding unexpected columns.
Deterministic. The same input should always produce the same output. Avoid CURRENT_TIMESTAMP or RAND() in transformation logic.
Testable. Write transformations so their logic can be unit-tested with small synthetic inputs. One assertion per rule: "all rows have a non-null order_id", "amount is always positive".
Readable. The engineer who debugs this at midnight will not be you. Use CTEs, meaningful aliases, and one column per line in SELECT lists.