Tutorial Data Engineering

Warehouses, lakes, and lakehouses — what they are and when each wins

SkyDeLake Jun 28, 2026 5 min read 1 views

Data Engineering Fundamentals

View full path →
  1. What data engineers actually do
  2. The anatomy of a data pipeline
  3. Warehouses, lakes, and lakehouses — what they are and when each wins
  4. Why Parquet? File formats explained
  5. Partitioning data at scale
  6. Predicate pushdown and column pruning — how query engines skip work
  7. Delta Lake, Iceberg, Hudi — open table formats explained
  8. Extracting data from anywhere — APIs, databases, and files
  9. Change Data Capture explained
  10. Batch vs streaming — choosing the right ingestion model
  11. SQL for data engineers
  12. dbt from first principles
  13. The medallion architecture — Bronze, Silver, Gold
  14. Idempotency — the most important property a data pipeline can have
  15. Orchestration fundamentals

If you ask ten data engineers what database they use, you'll likely get ten different answers: Snowflake, BigQuery, Databricks, a data lake on S3, an Iceberg lakehouse. These aren't just different products — they represent fundamentally different architectural approaches to the same problem.

To understand why so many options exist, you have to understand the problems each one was built to solve. This post traces that history and lands on why the industry is converging on lakehouses.

Why your production database isn't enough

Every business starts with a transactional database — PostgreSQL, MySQL, SQL Server. These are built for OLTP: Online Transaction Processing. They're optimised to handle thousands of individual reads and writes per second. When a user places an order, a row is inserted. When they update their address, a row is changed. Every operation touches a small number of rows.

Analytical queries are the opposite. An analyst asks: "What was total revenue by country last month, broken down by product category?" That query needs to scan millions of rows, aggregate them, and join several tables. Running it on your production database competes with live user traffic. At small scale this is fine. At any meaningful scale, it degrades performance for everyone.

This is the original problem that data warehouses were built to solve.

The data warehouse

A data warehouse is a database built for analytical queries (OLAP: Online Analytical Processing). The key architectural difference from OLTP systems is columnar storage.

In a row-based database (OLTP), each row is stored contiguously on disk. Reading a single column means reading every row in full and discarding what you don't need. In a columnar database (OLAP), each column is stored contiguously. If your query only touches three columns, the engine reads only those three columns — skipping the rest entirely. For wide tables with dozens of columns, this dramatically reduces I/O.

Row store (OLTP):
[id=1, name=Alice, country=UK, revenue=100]
[id=2, name=Bob,   country=US, revenue=200]

Column store (OLAP):
[id: 1, 2]
[name: Alice, Bob]
[country: UK, US]
[revenue: 100, 200]

Query: SELECT SUM(revenue) → reads only the revenue column

Data warehouses like Teradata (1979), then Redshift, Snowflake, and BigQuery brought this architecture to the cloud. Extract data from production systems, load it into the warehouse, run analytics there. Fast queries, no production impact.

The problem with warehouses: they're expensive at scale and lock data into proprietary formats. Storing petabytes in Snowflake costs significantly more than storing the same data in object storage. And the data is only accessible via the warehouse's query engine — you can't point Spark or a Python script directly at it.

The data lake

Around 2010, Hadoop popularised a different idea: store everything as files in a distributed filesystem (HDFS), query it with MapReduce or later Spark. When cloud object storage (S3, GCS) arrived and became cheaper than on-premise storage by an order of magnitude, this architecture became the data lake.

The data lake proposition: dump all your raw data into object storage as files (CSV, JSON, Parquet), and query it on-demand with distributed compute engines. Cheap storage, flexible — you can point any tool at the same files.

The problems with data lakes:

The lakehouse

The lakehouse emerged as the answer: keep the cheap, open storage of the data lake, but add a metadata and transaction layer on top that gives you warehouse features.

The key innovation is the open table format: a metadata layer that sits above your Parquet files and tracks which files belong to which table version, enforces schemas, and manages ACID transactions. The three main formats — Delta Lake, Apache Iceberg, and Apache Hudi — all implement this idea differently (covered in depth in Post 7).

What a lakehouse gives you:

Lakehouse architecture:

Object storage (S3 / GCS / ADLS)
├── _delta_log/          ← transaction log (Delta Lake)
│   ├── 000.json
│   └── 001.json
├── part-0001.parquet
└── part-0002.parquet

Any engine reads the log first, then the relevant Parquet files.

How to choose

In practice, the right choice depends on your scale and existing infrastructure:

Use a managed warehouse (Snowflake, BigQuery) if you want something that just works, your team is SQL-first, and storage costs at your current scale are not a significant concern. These products have excellent query performance, built-in governance features, and require less operational overhead.

Use a lakehouse (Databricks, open Iceberg on S3 + Trino/Spark) if you have large data volumes where storage costs matter, you need to run both SQL analytics and ML workloads on the same data, or you need to avoid vendor lock-in and want portability across query engines.

Use both — many mature data teams do. Raw and transformed data live in a lakehouse (cheap, flexible), but a curated subset is synced to a warehouse for BI tools and ad-hoc analyst queries (fast, convenient).

Where this series goes next

Now that you understand what these storage systems are, the next logical question is: how is data actually stored inside them? The file format — CSV, JSON, Parquet — shapes query performance more than almost any other decision. Post 4 covers why.