Tutorial Data Engineering

Partitioning data at scale

SkyDeLake Jun 28, 2026 4 min read 0 views

Data Engineering Fundamentals

View full path →
  1. What data engineers actually do
  2. The anatomy of a data pipeline
  3. Warehouses, lakes, and lakehouses — what they are and when each wins
  4. Why Parquet? File formats explained
  5. Partitioning data at scale
  6. Predicate pushdown and column pruning — how query engines skip work
  7. Delta Lake, Iceberg, Hudi — open table formats explained
  8. Extracting data from anywhere — APIs, databases, and files
  9. Change Data Capture explained
  10. Batch vs streaming — choosing the right ingestion model
  11. SQL for data engineers
  12. dbt from first principles
  13. The medallion architecture — Bronze, Silver, Gold
  14. Idempotency — the most important property a data pipeline can have
  15. Orchestration fundamentals

Most analytical data grows in one dimension: time. Every new row has a timestamp — when the order was placed, when the event happened, when the record was updated. And almost every analytical query filters by time: "last 30 days", "Q1 2024", "since we launched this feature".

Partitioning is the technique that makes this pattern efficient. It's one of the most impactful performance and cost optimisations in data engineering, and it's also one of the most misused.

What partitioning is

Partitioning means physically organising data files by the value of one or more columns, so that queries can skip entire files when they don't match a filter.

Without partitioning, a table is a flat collection of files. A query with WHERE order_date = '2024-01-15' has to scan every file to find matching rows.

With date partitioning, the same table is stored as a tree of directories, one per date:

s3://data-lake/orders/
├── order_date=2024-01-13/ → part-0001.parquet, part-0002.parquet
├── order_date=2024-01-14/ → part-0001.parquet
└── order_date=2024-01-15/ → part-0001.parquet, part-0002.parquet

A query with WHERE order_date = '2024-01-15' reads only the files in the order_date=2024-01-15/ directory. The files for Jan 13 and 14 are not touched at all. If your table has 3 years of data, the query reads 1 day out of 1095 — roughly 0.09% of the data.

Partition pruning

The act of skipping partitions that don't match the query filter is called partition pruning. Query engines (Spark, Trino, Athena, BigQuery) implement this automatically when they detect a filter on a partition column. You don't write any special SQL — you just filter on the partition column and the engine handles it.

-- This query scans only the 2024-01-15 partition
SELECT COUNT(*), SUM(amount)
FROM orders
WHERE order_date = '2024-01-15';

-- This query scans all partitions (no partition filter)
SELECT COUNT(*), SUM(amount)
FROM orders
WHERE customer_id = 12345;

The second query is a full table scan regardless of partitioning — partitioning only helps when you filter on the partition column.

Choosing the right partition column

The partition column should match how the data is most commonly queried. The most common choice is a date or timestamp column, because:

Secondary partition columns can be added for additional pruning. For example, partitioning by (order_date, region) means a query filtering on both date and region reads only the relevant date+region combinations. But adding partition columns multiplies the number of directories — over-partitioning creates its own problems.

The small files problem

Over-partitioning is the most common partitioning mistake. Every partition should contain files large enough for the query engine to work with efficiently — typically 128MB to 1GB per file. When partitions are too granular (partitioning by hour when you have low data volumes, for example), each partition might contain only a few small files.

Query engines handle large numbers of small files poorly:

A useful heuristic: if a partition regularly contains files smaller than 32MB, the partition granularity is too fine. Coarsen the partition (daily instead of hourly, monthly instead of daily for low-volume tables).

Compaction

When data is written to a table in small batches — streaming ingestion, frequent micro-batch jobs — many small files accumulate within each partition over time. Compaction is the process of periodically merging these small files into larger ones.

Before compaction (partition order_date=2024-01-15):
  part-00000001.parquet  2 MB
  part-00000002.parquet  3 MB
  part-00000003.parquet  1 MB
  ... (200 more files averaging 2 MB each)

After compaction:
  part-optimised-001.parquet  200 MB
  part-optimised-002.parquet  195 MB

Delta Lake and Apache Iceberg have built-in compaction commands (OPTIMIZE in Delta Lake, rewrite_data_files in Iceberg). Run them on a schedule — typically daily or weekly — for partitions that receive frequent small writes.

Partition evolution

One limitation of Hive-style partitioning (directory-based): changing your partition scheme requires rewriting all existing data. If you originally partitioned by day and now want to partition by month, you have to move every file.

Apache Iceberg solves this with hidden partitioning: partitioning is tracked in the table metadata, not the file paths. Changing the partition scheme (called partition evolution in Iceberg) is a metadata-only operation — existing files don't move, new files use the new scheme, and queries transparently handle both.

Partitioning in practice

Default recommendations for most data engineering use cases:

Partitioning is free to add and has no downside if designed correctly. The downside comes from poor partition choices — either too coarse (little pruning benefit) or too fine (small files problem).