The title sounds obvious — but ask five people at a tech company and you'll get five different answers. Here's the complete picture: what the role involves, how it differs from software engineering and data science, and why this discipline exists at all.
Sources, ingestion, raw storage, transformation, serving — every data pipeline has the same five stages. This post maps them, explains what happens at each one, and shows where things go wrong.
Three architectural approaches to storing and querying data at scale, each built to solve the problems of the one before it. Here is why lakehouses are where the industry is converging.
CSV, JSON, Parquet, Avro, ORC — what each format is optimised for and when to use it. Row vs columnar storage from first principles, with a clear decision rule for data engineering use cases.
Partitioning organises files by column value so queries can skip irrelevant data entirely. How partition pruning works, how to choose the right partition column, and why over-partitioning causes its own problems.
The techniques that make Parquet dramatically faster than CSV on the same hardware. Column pruning, row group statistics, bloom filters, and why sorting data within files amplifies all of them.
Open table formats add ACID transactions, time travel, and schema enforcement on top of object storage. What Delta Lake, Iceberg, and Hudi each bring, how they differ, and how to choose.
Before data can be transformed or analysed it has to be extracted from wherever it lives. Full vs incremental extraction, handling API rate limits and pagination, database extraction patterns, and file ingestion.
Polling misses deletes and updates without timestamps. CDC reads the database write-ahead log directly — capturing every insert, update, and delete with perfect fidelity. How it works, how to set it up, and when you need it.
The choice between batch and streaming shapes every downstream decision. A clear-eyed look at the tradeoffs, when micro-batch is the right middle ground, and how to make the call for your use case.
Window functions, CTEs, slowly changing dimensions, NULL handling, date manipulation, and the join patterns that appear in every real pipeline. SQL through the lens of data engineering, not analytics.
How dbt turned SQL transformation into proper software development. Models, the ref() function, materialisation types, tests, sources, and the layered project structure that keeps transformation layers maintainable.
The most widely adopted pattern for organising data in a lakehouse. What happens at each layer, why the separation of concerns matters, and how reprocessability from Bronze makes the whole pipeline recoverable.
Pipelines fail. The question is whether running them twice produces duplicates or correct results. Idempotency — what it is, how to design for it with upserts and partition-based rewrites, and how to test it.
What turns a collection of working scripts into a production data platform. DAGs, scheduling, dependency management, failure handling, and visibility — and how Airflow, Prefect, and Dagster each approach them.