Why Parquet? File formats explained

Every piece of data engineering advice eventually says "use Parquet." But if you don't know why, you'll use it blindly — and you'll make poor format choices at the edges where it actually matters.

Data format is one of the highest-leverage decisions in a data pipeline. The wrong format can make a query ten times slower, a job consume ten times the memory, or a schema change break downstream systems. This post explains the tradeoffs from first principles.

The two storage models

Before comparing specific formats, you need to understand the fundamental divide: row-oriented vs column-oriented storage.

Row-oriented formats store each record's fields together. To read a record, you read one contiguous block. CSV and JSON are row-oriented.

Column-oriented formats store all values for each field together. To read a column across all records, you read one contiguous block. Parquet and ORC are column-oriented.

Dataset: 3 orders with 4 fields each

Row-oriented (CSV):
1,Alice,UK,100
2,Bob,US,200
3,Carol,DE,150

Column-oriented (Parquet, conceptually):
IDs:      [1, 2, 3]
Names:    [Alice, Bob, Carol]
Country:  [UK, US, DE]
Revenue:  [100, 200, 150]

The practical consequence: if your query only needs the Revenue column, a row-oriented format reads all four columns for every row and discards three of them. A column-oriented format reads only the Revenue column. At a billion rows across 50 columns, this difference is enormous.

CSV

CSV is ubiquitous and human-readable. It's also one of the worst formats for production data pipelines.

Problems with CSV:

No schema. Every consumer has to infer or assume column names and types. A column of dates might be 2024-01-15, 01/15/2024, or January 15, 2024 — all valid CSV, all different to parse.
No type enforcement. Everything is a string. null, NULL, None, , and 0 are all different representations of "missing value" that consumers handle differently.
Delimiter collisions. A CSV with comma-separated values breaks when a value contains a comma. Quoting handles this, but inconsistently across tools.
Row-oriented. No column pruning, no predicate pushdown, no compression of similar values.
No native compression. You can gzip a CSV file, but then you have to decompress the entire file to read any part of it.

When to use CSV: data exchange with external parties who can't handle binary formats, human-readability requirements, or very small datasets where the overhead doesn't matter.

JSON

JSON is the lingua franca of APIs. It supports nested structures and is self-describing (keys are included with every record). These properties make it good for event streaming and API responses.

The problem with JSON for analytical workloads: it's verbose. Storing the key with every record means that for a billion records, you store the string "customer_id" a billion times. JSON files can be 3–5x larger than equivalent Parquet files. Parsing JSON is also CPU-intensive compared to reading binary formats.

JSON Lines (one JSON object per line, no outer array) is more pipeline-friendly than standard JSON — it allows streaming reads and partial processing without loading the whole file into memory.

When to use JSON: raw landing zone (preserve the source format exactly), event streaming where records have variable schemas, API integrations.

Parquet

Parquet is the standard format for analytical workloads. It was designed specifically for the problem that CSV and JSON fail at: efficiently storing and querying large datasets with many columns.

Key properties:

Columnar storage. Each column is stored separately. Reading a query that touches 3 of 50 columns reads roughly 6% of the data. This is the primary performance advantage.

Efficient compression. Because similar values are stored together (all the country codes, all the status values), run-length encoding and dictionary encoding achieve very high compression ratios. A column of status values like [completed, completed, pending, completed, cancelled] can be encoded as a dictionary of 3 values plus a compact integer array — far smaller than storing the string five times.

Column statistics. Parquet files store min/max values for each column in each row group (a horizontal slice of the file). Query engines use these statistics for predicate pushdown: if a filter says WHERE date = '2024-01-15' and the row group's date range is 2024-02-01 to 2024-02-28, the engine skips that row group entirely without reading any of its data.

Schema embedded. The schema (column names and types) is stored in the file footer. No separate schema file needed. Readers always know the types.

Parquet file structure:
┌───────────────────────────────────┐
│  Row Group 1 (e.g. rows 1–100k)  │
│  ├── Column: order_id [int64]     │
│  │   ├── min: 1                   │
│  │   ├── max: 99982               │
│  │   └── data: [compressed pages] │
│  └── Column: amount [double]      │
│      ├── min: 1.50                │
│      ├── max: 9999.99             │
│      └── data: [compressed pages] │
├───────────────────────────────────┤
│  Row Group 2 (rows 100k–200k)    │
│  ...                              │
├───────────────────────────────────┤
│  Footer (schema + row group meta) │
└───────────────────────────────────┘

When to use Parquet: everything that's analytical in nature — data lake storage, data warehouse staging, intermediate pipeline outputs, feature stores. Default to Parquet unless you have a specific reason not to.

ORC

ORC (Optimised Row Columnar) is similar to Parquet — columnar, compressed, with embedded schema and statistics. ORC originated from the Hive ecosystem (Apache Hadoop). Parquet originated from the Spark/Impala ecosystem.

In practice, they perform similarly. The main reason to choose ORC over Parquet is if your primary query engine is Hive, which historically had better ORC support. For most modern stacks (Spark, Trino, Athena, BigQuery federated, DuckDB), Parquet support is equal or better.

Avro

Avro is a row-oriented binary format, which makes it unusual. It's designed for streaming: fast row-level serialisation and deserialisation, with schema stored separately in a schema registry. The schema registry lets producers and consumers evolve schemas independently — adding nullable columns, renaming fields — without breaking existing producers or consumers.

When to use Avro: Kafka topics (Avro is the standard serialisation format for Kafka with a schema registry), event streams where you need strong schema evolution guarantees. Not a good choice for analytical/OLAP queries.

The decision rule

Landing zone / raw storage: preserve source format (often JSON or CSV), or convert to Parquet immediately on landing
All analytical storage (data lake, warehouse staging, intermediate models): Parquet
Event streaming / Kafka: Avro with schema registry
External data exchange with non-technical partners: CSV (with documentation)
ORC: only if your primary engine requires it

If you're unsure, use Parquet. The default is the right call for the vast majority of data engineering work.