If you ask ten data engineers what database they use, you'll likely get ten different answers: Snowflake, BigQuery, Databricks, a data lake on S3, an Iceberg lakehouse. These aren't just different products — they represent fundamentally different architectural approaches to the same problem.
To understand why so many options exist, you have to understand the problems each one was built to solve. This post traces that history and lands on why the industry is converging on lakehouses.
Why your production database isn't enough
Every business starts with a transactional database — PostgreSQL, MySQL, SQL Server. These are built for OLTP: Online Transaction Processing. They're optimised to handle thousands of individual reads and writes per second. When a user places an order, a row is inserted. When they update their address, a row is changed. Every operation touches a small number of rows.
Analytical queries are the opposite. An analyst asks: "What was total revenue by country last month, broken down by product category?" That query needs to scan millions of rows, aggregate them, and join several tables. Running it on your production database competes with live user traffic. At small scale this is fine. At any meaningful scale, it degrades performance for everyone.
This is the original problem that data warehouses were built to solve.
The data warehouse
A data warehouse is a database built for analytical queries (OLAP: Online Analytical Processing). The key architectural difference from OLTP systems is columnar storage.
In a row-based database (OLTP), each row is stored contiguously on disk. Reading a single column means reading every row in full and discarding what you don't need. In a columnar database (OLAP), each column is stored contiguously. If your query only touches three columns, the engine reads only those three columns — skipping the rest entirely. For wide tables with dozens of columns, this dramatically reduces I/O.
Row store (OLTP):
[id=1, name=Alice, country=UK, revenue=100]
[id=2, name=Bob, country=US, revenue=200]
Column store (OLAP):
[id: 1, 2]
[name: Alice, Bob]
[country: UK, US]
[revenue: 100, 200]
Query: SELECT SUM(revenue) → reads only the revenue column
Data warehouses like Teradata (1979), then Redshift, Snowflake, and BigQuery brought this architecture to the cloud. Extract data from production systems, load it into the warehouse, run analytics there. Fast queries, no production impact.
The problem with warehouses: they're expensive at scale and lock data into proprietary formats. Storing petabytes in Snowflake costs significantly more than storing the same data in object storage. And the data is only accessible via the warehouse's query engine — you can't point Spark or a Python script directly at it.
The data lake
Around 2010, Hadoop popularised a different idea: store everything as files in a distributed filesystem (HDFS), query it with MapReduce or later Spark. When cloud object storage (S3, GCS) arrived and became cheaper than on-premise storage by an order of magnitude, this architecture became the data lake.
The data lake proposition: dump all your raw data into object storage as files (CSV, JSON, Parquet), and query it on-demand with distributed compute engines. Cheap storage, flexible — you can point any tool at the same files.
The problems with data lakes:
- No ACID transactions. If two jobs write to the same table simultaneously, you get corrupted or inconsistent results. Object storage doesn't have row-level locking.
- No schema enforcement. Anyone can write any schema to any path. Over time, data lakes become "data swamps" — enormous collections of files that nobody trusts.
- Slow updates and deletes. Object storage is designed for immutable files. Updating a single row means rewriting the entire Parquet file. Deleting records for GDPR compliance is painful.
- No time travel. If a job overwrites your data with bad values, the old version is gone.
The lakehouse
The lakehouse emerged as the answer: keep the cheap, open storage of the data lake, but add a metadata and transaction layer on top that gives you warehouse features.
The key innovation is the open table format: a metadata layer that sits above your Parquet files and tracks which files belong to which table version, enforces schemas, and manages ACID transactions. The three main formats — Delta Lake, Apache Iceberg, and Apache Hudi — all implement this idea differently (covered in depth in Post 7).
What a lakehouse gives you:
- ACID transactions — concurrent reads and writes are safe. No more corrupted tables from failed jobs.
- Time travel — query the table as it was at any point in history. Accidentally overwrote data? Roll back.
- Schema enforcement and evolution — the table has a schema. Adding columns is safe. Breaking changes are caught.
- Open formats — the data is still Parquet files in S3. Any engine that understands the table format can query it: Spark, Trino, DuckDB, BigQuery's federated query, Athena.
- Cheap storage — object storage prices, not warehouse prices.
Lakehouse architecture:
Object storage (S3 / GCS / ADLS)
├── _delta_log/ ← transaction log (Delta Lake)
│ ├── 000.json
│ └── 001.json
├── part-0001.parquet
└── part-0002.parquet
Any engine reads the log first, then the relevant Parquet files.
How to choose
In practice, the right choice depends on your scale and existing infrastructure:
Use a managed warehouse (Snowflake, BigQuery) if you want something that just works, your team is SQL-first, and storage costs at your current scale are not a significant concern. These products have excellent query performance, built-in governance features, and require less operational overhead.
Use a lakehouse (Databricks, open Iceberg on S3 + Trino/Spark) if you have large data volumes where storage costs matter, you need to run both SQL analytics and ML workloads on the same data, or you need to avoid vendor lock-in and want portability across query engines.
Use both — many mature data teams do. Raw and transformed data live in a lakehouse (cheap, flexible), but a curated subset is synced to a warehouse for BI tools and ad-hoc analyst queries (fast, convenient).
Where this series goes next
Now that you understand what these storage systems are, the next logical question is: how is data actually stored inside them? The file format — CSV, JSON, Parquet — shapes query performance more than almost any other decision. Post 4 covers why.