Tutorial Data Engineering

What data engineers actually do

The SkyDeLake Admin Jun 28, 2026 7 min read 20 views

Data Engineering Fundamentals

View full path →
  1. What data engineers actually do
  2. The anatomy of a data pipeline
  3. Warehouses, lakes, and lakehouses — what they are and when each wins
  4. Why Parquet? File formats explained
  5. Partitioning data at scale
  6. Predicate pushdown and column pruning — how query engines skip work
  7. Delta Lake, Iceberg, Hudi — open table formats explained
  8. Extracting data from anywhere — APIs, databases, and files
  9. Change Data Capture explained
  10. Batch vs streaming — choosing the right ingestion model
  11. SQL for data engineers
  12. dbt from first principles
  13. The medallion architecture — Bronze, Silver, Gold
  14. Idempotency — the most important property a data pipeline can have
  15. Orchestration fundamentals

The title sounds like it should be obvious — data, engineering, done. But ask five people at a tech company what a data engineer does, and you'll likely get five different answers. Some will say "they build ETL pipelines." Others will say "they manage the data warehouse." A few will shrug and say "basically a backend engineer who works with Spark."

They're all partially right. This post gives you the complete picture — what the role actually involves, how it differs from adjacent disciplines, and why it exists at all.

The one-sentence version

A data engineer's job is to move data from where it is to where it needs to be — reliably, at scale, and on time.

Everything else — the tools, the frameworks, the architecture decisions — is in service of that one sentence. If the data doesn't arrive, arrives wrong, or arrives three days late, the data scientist can't model it, the analyst can't report on it, and the business decision doesn't get made. The data engineer is the person who makes sure none of that happens.

Why this role exists

Data engineering as a distinct discipline is relatively recent. A decade ago, most companies were small enough that a single backend engineer could write the SQL to power reports. Three things changed that:


Data engineering is the discipline that emerged to manage these consequences.

The three core problems

At root, a data engineer is always solving one of three problems.

1. Getting data out of one system and into another

Source systems — your application's database, your SaaS tools, your event streams — were not designed to share their data efficiently. They need to be queried, polled, or subscribed to, and the data needs to be extracted in a form that downstream systems can use.

This is harder than it sounds. APIs have rate limits. Databases lock rows during exports. Real-time event streams need careful handling to avoid losing or duplicating records. Getting data out cleanly and completely is its own engineering problem.

2. Making raw data clean and useful

Raw data is almost never in the shape that analysts or models need. Columns are named inconsistently across systems. The same customer might appear under three different IDs in three different tables. Revenue figures might be stored in cents in one place and dollars in another. A timestamp column might be in UTC in one table and local time in another.

Transforming raw, messy source data into clean, consistent, analysis-ready tables is a large part of the day-to-day work — and where most of the business logic lives.

3. Making both of the above reliable

A pipeline that works once is a script. A pipeline that works every day for three years, retries gracefully on failure, alerts when something goes wrong, and doesn't silently corrupt data — that's engineering.

Most of the craft in data engineering lives in this third problem. It's the difference between a prototype and a production system.

How it differs from software engineering

Software engineers build systems that serve users. A backend engineer builds the API your application calls; a frontend engineer builds the interface users interact with. The "customer" of a software engineer's work is an end user doing something in a product.

A data engineer's customer is downstream data consumers — analysts, data scientists, machine learning models, and business intelligence tools. The output isn't a user-facing feature; it's a dataset, a pipeline, or a data platform that other people build on top of.

This creates a meaningfully different set of engineering priorities:


Data engineers do write a lot of code — Python is now the dominant language — but the patterns are different. You're less often building an API and more often writing a job that reads a billion rows, transforms them, and writes them somewhere else.

How it differs from data science

The simplest way to draw this line: data scientists answer questions with data. Data engineers make sure the data is there to answer questions with.

A data scientist might build a model that predicts customer churn. To do that, they need:


Every one of those requirements is a data engineering problem. The data scientist doesn't want to spend their time building ETL pipelines — and they're usually not best placed to make pipelines reliable at scale. That's the data engineer's job.

Data scientists build on top of what data engineers build. When the foundation is solid, data science moves fast. When it isn't, data scientists spend half their time cleaning data instead of modelling it.

In practice, the boundary shifts from team to team. At a small company, one person might do both. At a larger company, there are dedicated teams for each, and sometimes a third role — the analytics engineer — who sits in between, owning the transformation layer where business logic lives.

What a data engineer actually builds

On any given project, a data engineer might be building one of these:

Ingestion pipelines — jobs that extract data from source systems (databases, APIs, event streams) and land it into a central storage layer. These need to handle partial failures, duplicates, late-arriving data, and upstream schema changes without manual intervention.

Data warehouses and lakehouses — the central stores where raw and processed data lives. Deciding how to structure them, how to partition data for query performance, which file formats to use, and how to manage storage costs is a significant design problem in itself.

Transformation pipelines — jobs that take raw, messy source data and produce clean, structured, analysis-ready tables. dbt (data build tool) has become the standard tool for this layer in most modern stacks — it brings software engineering practices like version control, testing, and documentation to SQL-based transformations.

Orchestration — the scheduler that runs everything in the right order, at the right time, and re-runs failed steps automatically. Apache Airflow is the most common tool; newer alternatives like Prefect and Dagster have gained significant ground.

Data quality systems — tests and monitors that catch bad data before it reaches analysts or models. A table that loses 40% of its rows overnight, or a column that starts returning nulls, should trigger an alert — not be discovered by a surprised analyst three weeks later when a report looks wrong.

The modern data engineering stack

The tools you'll encounter most often in the field:


You won't use all of these simultaneously. A typical stack picks one tool from each category. The combination that has become a de facto standard at many companies — S3 or GCS for storage, Snowflake or BigQuery as the warehouse, dbt for transformation, and Airflow for orchestration — is often called the modern data stack.

This series covers most of these tools and the concepts behind them. But tools change; the underlying concepts don't. Understanding why a tool exists is more durable than knowing how to configure it.

A typical week

To make this concrete, here's roughly what a week might look like for a data engineer at a mid-sized company.

Monday: A pipeline that loads order data from the production database into the warehouse failed over the weekend. The source table added a new column and the schema wasn't updated downstream. Fix the schema, backfill the missed runs, add a schema contract test so this fails loudly next time instead of silently.

Tuesday–Wednesday: Build a new ingestion pipeline for a third-party marketing tool the growth team just onboarded. Extract the data via API, handle pagination and rate limits, land it in the raw layer, write a basic transformation that makes it joinable with customer data.

Thursday: An analyst asks why revenue figures in one dashboard don't match another. Trace both queries back to their source tables, discover one is using event time and the other processing time. Document the correct definition in the data catalogue, update the model, and let the analyst know.

Friday: Code review for a teammate's new transformation. A query scanning 500GB on every run when a partition filter would reduce that to 10GB. Explain why, show the fix, push.

There's less glamour and more plumbing than the job title suggests. The work is often unglamorous — fixing broken schemas, chasing down mismatched row counts, debugging a pipeline that fails on the third Tuesday of every month for reasons that take two days to find. But the plumbing is what makes everything else work — the models, the dashboards, the product decisions that depend on reliable data being in the right place.

Where this series goes from here

This series walks through data engineering from the ground up — the concepts, the architecture decisions, and the practical patterns behind each layer of the stack.

The next post builds the mental model that ties everything together: the anatomy of a data pipeline, from source system to serving layer, and what happens at each stage. It's the map you'll refer back to throughout the rest of the series.