Data engineers work in SQL more than any other language. More than Python, more than Scala, more than anything else. Not because SQL is trendy, but because it is the most expressive language for data transformation — and virtually every data warehouse and lakehouse engine runs it natively.
This post is not SQL basics. It's the SQL that actually matters in a data engineering context: the patterns, functions, and approaches that appear constantly in real pipelines.
Window functions — the most important concept
If there's one SQL feature that separates analysts who write data pipelines from data engineers who build them, it's window functions. They perform calculations across a set of rows related to the current row, without collapsing those rows the way GROUP BY does.
The anatomy of a window function:
function_name() OVER (
PARTITION BY column -- group rows (like GROUP BY, but don't collapse)
ORDER BY column -- define order within the window
ROWS/RANGE BETWEEN ... AND ... -- optional: frame of rows to include
)
The most useful window functions in data engineering:
-- ROW_NUMBER: assign a sequential number within each group
-- Use case: deduplication — keep only the most recent record per customer
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
FROM orders
) t
WHERE rn = 1;
-- LAG / LEAD: access the previous or next row
-- Use case: compute time between consecutive events
SELECT
order_id,
customer_id,
created_at,
LAG(created_at) OVER (PARTITION BY customer_id ORDER BY created_at) AS prev_order_at,
DATEDIFF('day',
LAG(created_at) OVER (PARTITION BY customer_id ORDER BY created_at),
created_at
) AS days_since_prev_order
FROM orders;
-- SUM() OVER: running total
SELECT
order_id,
customer_id,
amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY created_at) AS cumulative_spend
FROM orders;
-- NTILE: divide rows into N equal buckets
-- Use case: segment customers into quartiles by spend
SELECT
customer_id,
total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
FROM customer_totals;
CTEs — readable pipelines in SQL
Common Table Expressions (CTEs) let you break a complex query into named steps. This is how data engineers write SQL that other people can actually read.
WITH
-- Step 1: clean the raw orders
cleaned_orders AS (
SELECT
id AS order_id,
customer_id,
UPPER(TRIM(status)) AS status,
amount / 100.0 AS amount_usd,
DATE(created_at) AS order_date
FROM raw.orders
WHERE id IS NOT NULL
),
-- Step 2: compute per-customer aggregates
customer_stats AS (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount_usd) AS total_spend,
MAX(order_date) AS last_order_date
FROM cleaned_orders
WHERE status = 'COMPLETED'
GROUP BY customer_id
),
-- Step 3: segment customers
customer_segments AS (
SELECT
customer_id,
total_spend,
order_count,
last_order_date,
CASE
WHEN total_spend >= 1000 THEN 'high_value'
WHEN total_spend >= 200 THEN 'mid_value'
ELSE 'low_value'
END AS segment
FROM customer_stats
)
SELECT * FROM customer_segments
ORDER BY total_spend DESC;
Each CTE is a named, reusable step. Unlike subqueries, CTEs are defined once and can be referenced multiple times in the same query. In dbt, each CTE step typically becomes its own model file — making the logic even more modular and testable.
Slowly Changing Dimensions (SCDs)
One of the most common data modelling challenges: how do you track how a record has changed over time? A customer's country, plan tier, or email address can change — but you often need to know what it was at the time of each transaction.
The standard approach is SCD Type 2: rather than overwriting the old record, you close it (add an effective_to date) and insert a new record.
-- SCD Type 2 customer dimension
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY, -- surrogate key
customer_id INT, -- natural key from source
email VARCHAR(255),
plan VARCHAR(50),
country CHAR(2),
effective_from DATE NOT NULL,
effective_to DATE, -- NULL means current record
is_current BOOLEAN NOT NULL DEFAULT TRUE
);
-- To find a customer's plan at order time:
SELECT
o.order_id,
o.customer_id,
o.amount,
o.order_date,
c.plan AS plan_at_order_time
FROM orders o
JOIN dim_customer c
ON o.customer_id = c.customer_id
AND o.order_date BETWEEN c.effective_from AND COALESCE(c.effective_to, '9999-12-31')
AND c.is_current = FALSE; -- include historical rows
NULL handling
NULL behaviour in SQL is one of the most common sources of silent bugs. Three rules to know:
- Any arithmetic with NULL returns NULL.
100 + NULL = NULL. UseCOALESCE(amount, 0)to substitute a default. - NULL comparisons always return false.
WHERE status = NULLmatches nothing. UseWHERE status IS NULL. - NULL in GROUP BY is treated as a single group. Multiple NULLs in a grouping column are grouped together — this is often not what you want.
-- Pattern: handle NULLs explicitly in aggregations
SELECT
customer_id,
COUNT(*) AS total_orders,
COUNT(discount_code) AS orders_with_discount, -- NULLs excluded automatically
SUM(COALESCE(discount_amount, 0)) AS total_discount,
AVG(NULLIF(delivery_days, 0)) AS avg_delivery_days -- exclude 0-day deliveries (same-day)
FROM orders
GROUP BY customer_id;
Date and time manipulation
Data engineers spend an enormous amount of time working with dates. The functions vary by engine, but the patterns are consistent:
-- Truncate to a period (group by week, month, quarter)
DATE_TRUNC('month', order_date) -- Snowflake, BigQuery, PostgreSQL
DATE_TRUNC(order_date, MONTH) -- BigQuery alternative syntax
-- Extract a component
EXTRACT(DOW FROM order_date) -- day of week (0=Sunday)
DATE_PART('hour', event_timestamp) -- hour of day
-- Add/subtract intervals
order_date + INTERVAL '7 days'
DATEADD('day', 7, order_date) -- Snowflake
-- Calculate differences
DATEDIFF('day', signup_date, first_order_date)
-- Convert timestamps to dates (strip time component)
DATE(event_timestamp)
-- Working with time zones
CONVERT_TIMEZONE('UTC', 'Europe/London', event_timestamp) -- Snowflake
AT TIME ZONE 'Europe/London' -- PostgreSQL
Joins and their subtle behaviours
Two join patterns that catch data engineers regularly:
-- Anti-join: rows in A that have no match in B
-- Useful for: finding customers who never ordered, finding gaps
SELECT c.customer_id
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;
-- Fan-out on many-to-many joins: always check for row multiplication
-- If one customer has 3 addresses, joining on customer_id triples every row
SELECT
o.order_id,
o.amount,
COUNT(*) OVER (PARTITION BY o.order_id) AS row_count -- > 1 indicates fan-out
FROM orders o
JOIN customer_addresses a ON o.customer_id = a.customer_id;
Whenever writing a join in a pipeline, verify the grain of the result. If you expect one row per order and you're joining to a table that can have multiple rows per customer, you'll silently multiply your data.
Writing SQL for pipelines vs queries
A one-off analysis query can be clever and compact. SQL in a pipeline has to prioritise different things:
- Explicit over implicit. Name every column instead of using
SELECT *. Future schema changes will break the pipeline in a visible, controlled way rather than silently adding unexpected columns. - Deterministic. The same input should always produce the same output. Avoid
CURRENT_TIMESTAMPorRAND()in transformation logic. - Testable. Write transformations so their logic can be unit-tested with small synthetic inputs. One assertion per rule: "all rows have a non-null order_id", "amount is always positive".
- Readable. The engineer who debugs this at midnight will not be you. Use CTEs, meaningful aliases, and one column per line in SELECT lists.