Partitioning data at scale

Most analytical data grows in one dimension: time. Every new row has a timestamp — when the order was placed, when the event happened, when the record was updated. And almost every analytical query filters by time: "last 30 days", "Q1 2024", "since we launched this feature".

Partitioning is the technique that makes this pattern efficient. It's one of the most impactful performance and cost optimisations in data engineering, and it's also one of the most misused.

What partitioning is

Partitioning means physically organising data files by the value of one or more columns, so that queries can skip entire files when they don't match a filter.

Without partitioning, a table is a flat collection of files. A query with WHERE order_date = '2024-01-15' has to scan every file to find matching rows.

With date partitioning, the same table is stored as a tree of directories, one per date:

s3://data-lake/orders/
├── order_date=2024-01-13/ → part-0001.parquet, part-0002.parquet
├── order_date=2024-01-14/ → part-0001.parquet
└── order_date=2024-01-15/ → part-0001.parquet, part-0002.parquet

A query with WHERE order_date = '2024-01-15' reads only the files in the order_date=2024-01-15/ directory. The files for Jan 13 and 14 are not touched at all. If your table has 3 years of data, the query reads 1 day out of 1095 — roughly 0.09% of the data.

Partition pruning

The act of skipping partitions that don't match the query filter is called partition pruning. Query engines (Spark, Trino, Athena, BigQuery) implement this automatically when they detect a filter on a partition column. You don't write any special SQL — you just filter on the partition column and the engine handles it.

-- This query scans only the 2024-01-15 partition
SELECT COUNT(*), SUM(amount)
FROM orders
WHERE order_date = '2024-01-15';

-- This query scans all partitions (no partition filter)
SELECT COUNT(*), SUM(amount)
FROM orders
WHERE customer_id = 12345;

The second query is a full table scan regardless of partitioning — partitioning only helps when you filter on the partition column.

Choosing the right partition column

The partition column should match how the data is most commonly queried. The most common choice is a date or timestamp column, because:

Almost all analytical queries have a time dimension
New data arrives in roughly date order, so each partition is written once and rarely updated
The cardinality (number of distinct values) is controllable — daily partitions give you 365 partitions per year, which is manageable

Secondary partition columns can be added for additional pruning. For example, partitioning by (order_date, region) means a query filtering on both date and region reads only the relevant date+region combinations. But adding partition columns multiplies the number of directories — over-partitioning creates its own problems.

The small files problem

Over-partitioning is the most common partitioning mistake. Every partition should contain files large enough for the query engine to work with efficiently — typically 128MB to 1GB per file. When partitions are too granular (partitioning by hour when you have low data volumes, for example), each partition might contain only a few small files.

Query engines handle large numbers of small files poorly:

Each file requires a separate I/O operation
Object storage charges per API request — millions of file opens add up
Planning the query (listing all partitions, opening all files) takes longer than actually executing it

A useful heuristic: if a partition regularly contains files smaller than 32MB, the partition granularity is too fine. Coarsen the partition (daily instead of hourly, monthly instead of daily for low-volume tables).

Compaction

When data is written to a table in small batches — streaming ingestion, frequent micro-batch jobs — many small files accumulate within each partition over time. Compaction is the process of periodically merging these small files into larger ones.

Before compaction (partition order_date=2024-01-15):
  part-00000001.parquet  2 MB
  part-00000002.parquet  3 MB
  part-00000003.parquet  1 MB
  ... (200 more files averaging 2 MB each)

After compaction:
  part-optimised-001.parquet  200 MB
  part-optimised-002.parquet  195 MB

Delta Lake and Apache Iceberg have built-in compaction commands (OPTIMIZE in Delta Lake, rewrite_data_files in Iceberg). Run them on a schedule — typically daily or weekly — for partitions that receive frequent small writes.

Partition evolution

One limitation of Hive-style partitioning (directory-based): changing your partition scheme requires rewriting all existing data. If you originally partitioned by day and now want to partition by month, you have to move every file.

Apache Iceberg solves this with hidden partitioning: partitioning is tracked in the table metadata, not the file paths. Changing the partition scheme (called partition evolution in Iceberg) is a metadata-only operation — existing files don't move, new files use the new scheme, and queries transparently handle both.

Partitioning in practice

Default recommendations for most data engineering use cases:

Partition analytical tables by date (daily granularity for most tables)
Add a second partition dimension only if the vast majority of queries filter on it
Target 256MB–1GB per partition per file after compaction
Run compaction regularly for tables with streaming or micro-batch writes
Prefer Iceberg for new tables — hidden partitioning avoids the rewrite cost if you ever need to change partition strategy

Partitioning is free to add and has no downside if designed correctly. The downside comes from poor partition choices — either too coarse (little pruning benefit) or too fine (small files problem).