Predicate pushdown and column pruning — how query engines skip work

Partitioning cuts down the amount of data a query engine reads by eliminating entire files. But within each file, there's a second layer of skipping: predicate pushdown and column pruning let the engine read only the rows and columns it actually needs.

These two techniques — combined with statistics stored inside the file — are what make Parquet dramatically faster than CSV for analytical queries, even on the same hardware, with the same data, with the same query.

Column pruning

Column pruning is straightforward: if a query only uses three of fifty columns, the query engine reads only the data for those three columns. This works because Parquet stores each column separately (columnar layout). The engine opens the file, reads the footer to find where each column's data is stored, and reads only the relevant column chunks.

-- This query reads only 2 of potentially 50+ columns
SELECT order_id, amount
FROM orders
WHERE order_date = '2024-01-15'

With a row-oriented format like CSV, this query would read all 50+ columns for every row and discard 48 of them. With Parquet, only order_id and amount are read from disk. For wide tables, this can reduce I/O by 90%+.

Predicate pushdown

Predicate pushdown means moving filter conditions ("predicates") as close to the data source as possible — pushing the filter into the file reader rather than reading all data and filtering afterwards.

Parquet files store statistics for each column within each row group (a horizontal slice of the file, typically 128MB):

Minimum value
Maximum value
Null count
Value count

These statistics are stored in the file footer, which is tiny (kilobytes). Reading the footer is essentially free. The engine reads the footer, checks the statistics for each row group against the query filter, and skips any row group where the filter cannot possibly match.

Table: orders (1 billion rows, partitioned by order_date)
Partition: order_date=2024-01-15 (3 Parquet files, each with 10 row groups)

Query: WHERE amount > 1000

Row group statistics (from file footer):
  Row group 1: amount min=0.50,  max=99.99   → skip (max < 1000)
  Row group 2: amount min=0.10,  max=450.00  → skip (max < 1000)
  Row group 3: amount min=50.00, max=9999.00 → read (max >= 1000, may have matches)
  Row group 4: amount min=0.01,  max=75.00   → skip (max < 1000)
  ...

In this example, most row groups are skipped without reading their data at all — just the footer scan. Only row groups where the maximum amount is >= 1000 need to be read.

Bloom filters

Min/max statistics work well for range queries (amount > 1000, date BETWEEN x AND y). They don't help for equality filters on high-cardinality columns like user IDs or UUIDs — the min/max range typically spans the entire possible range and no row groups can be skipped.

Bloom filters solve this for equality predicates. A bloom filter is a probabilistic data structure that can tell you, with zero false negatives, whether a value is definitely not present in a set. For each column (optionally — you have to enable this), Parquet can store a bloom filter per row group.

-- With a bloom filter on user_id, the engine can skip row groups that
-- definitely don't contain user_id = 'abc123', even for high-cardinality columns
SELECT * FROM events WHERE user_id = 'abc123'

Bloom filters add size to the Parquet file (small, but nonzero) and are most valuable for point lookups on high-cardinality ID columns. Most major Parquet writers (Spark, Iceberg, Delta Lake) support writing bloom filters — enable them on columns you frequently filter on with equality conditions.

Sorting within partitions

The effectiveness of predicate pushdown depends on how well data is sorted within the file. If rows with amount > 1000 are randomly distributed across all row groups, the min/max range for every row group will include values above 1000 — no row groups can be skipped.

If the data is sorted by amount within each file, row groups with low amounts will have a high max that's below 1000, and they'll be skipped cleanly. This is called data clustering or Z-ordering (for multi-column clustering).

Delta Lake's OPTIMIZE ... ZORDER BY (column) and Iceberg's rewrite_data_files ... sort_order implement this — they compact small files into larger ones while sorting the data to maximise predicate pushdown effectiveness.

-- Delta Lake: compact and sort by frequently-filtered columns
OPTIMIZE orders ZORDER BY (customer_id, status);

Dictionary encoding

One more technique worth understanding: dictionary encoding. When a column has low cardinality (a small number of distinct values — like a status column with values "pending", "completed", "cancelled"), Parquet automatically encodes it as a dictionary: a lookup table of distinct values, plus a compact array of integer indices.

Before dictionary encoding (status column):
pending, completed, completed, pending, cancelled, completed, ...

After dictionary encoding:
Dictionary: {0: "pending", 1: "completed", 2: "cancelled"}
Data:       [0, 1, 1, 0, 2, 1, ...]

The integer array compresses extremely well. A status column with 3 distinct values across a million rows might take 6MB as raw strings and 250KB after dictionary encoding and compression. This is automatic in Parquet — no configuration needed.

Putting it together

These techniques compose. A query that filters on date (partition pruning), reads 3 of 50 columns (column pruning), and filters amount > 1000 (predicate pushdown against row group statistics) might read 0.01% of the data that a naive scan would read. This is why the format choice matters so much — the same query, same hardware, same data can be 100–1000x faster with Parquet than with CSV.

You don't need to implement any of this yourself. Parquet stores the statistics; Spark, Trino, Athena, and DuckDB use them automatically. The engineering task is making sure your data is in Parquet (not CSV), partitioned correctly, and periodically optimised with Z-ordering/compaction when it matters for query performance.