Data Engineering Fundamentals

01

The title sounds obvious — but ask five people at a tech company and you'll get five different answers. Here's the complete picture: what the role involves, how it differs from software engineering and data science, and why this discipline exists at all.

02

The anatomy of a data pipeline

Sources, ingestion, raw storage, transformation, serving — every data pipeline has the same five stages. This post maps them, explains what happens at each one, and shows where things go wrong.

03

Warehouses, lakes, and lakehouses — what they are and when each wins

Three architectural approaches to storing and querying data at scale, each built to solve the problems of the one before it. Here is why lakehouses are where the industry is converging.

04

Why Parquet? File formats explained

CSV, JSON, Parquet, Avro, ORC — what each format is optimised for and when to use it. Row vs columnar storage from first principles, with a clear decision rule for data engineering use cases.

05

Partitioning data at scale

Partitioning organises files by column value so queries can skip irrelevant data entirely. How partition pruning works, how to choose the right partition column, and why over-partitioning causes its own problems.

06

Predicate pushdown and column pruning — how query engines skip work

The techniques that make Parquet dramatically faster than CSV on the same hardware. Column pruning, row group statistics, bloom filters, and why sorting data within files amplifies all of them.

07

Delta Lake, Iceberg, Hudi — open table formats explained

Open table formats add ACID transactions, time travel, and schema enforcement on top of object storage. What Delta Lake, Iceberg, and Hudi each bring, how they differ, and how to choose.

08

Extracting data from anywhere — APIs, databases, and files

Before data can be transformed or analysed it has to be extracted from wherever it lives. Full vs incremental extraction, handling API rate limits and pagination, database extraction patterns, and file ingestion.

09

Change Data Capture explained

Polling misses deletes and updates without timestamps. CDC reads the database write-ahead log directly — capturing every insert, update, and delete with perfect fidelity. How it works, how to set it up, and when you need it.

10

Batch vs streaming — choosing the right ingestion model

The choice between batch and streaming shapes every downstream decision. A clear-eyed look at the tradeoffs, when micro-batch is the right middle ground, and how to make the call for your use case.

11

SQL for data engineers

Window functions, CTEs, slowly changing dimensions, NULL handling, date manipulation, and the join patterns that appear in every real pipeline. SQL through the lens of data engineering, not analytics.

12

dbt from first principles

How dbt turned SQL transformation into proper software development. Models, the ref() function, materialisation types, tests, sources, and the layered project structure that keeps transformation layers maintainable.

13

The medallion architecture — Bronze, Silver, Gold

The most widely adopted pattern for organising data in a lakehouse. What happens at each layer, why the separation of concerns matters, and how reprocessability from Bronze makes the whole pipeline recoverable.

14

Idempotency — the most important property a data pipeline can have

Pipelines fail. The question is whether running them twice produces duplicates or correct results. Idempotency — what it is, how to design for it with upserts and partition-based rewrites, and how to test it.

15

Orchestration fundamentals

What turns a collection of working scripts into a production data platform. DAGs, scheduling, dependency management, failure handling, and visibility — and how Airflow, Prefect, and Dagster each approach them.