Orchestration fundamentals

A data pipeline that runs once is a script. A data pipeline that runs reliably, on schedule, in the right order, with retries on failure, and with visibility into what's happening — that's an orchestrated system. Orchestration is what turns a collection of working code into a production data platform.

What an orchestrator does

An orchestrator is responsible for four things:

Scheduling — run this task at 06:00 UTC, every day
Dependency management — run Task B only after Task A succeeds
Failure handling — if Task A fails, retry it up to 3 times before alerting
Observability — provide a UI or API to see what ran, when, what failed, and why

Without an orchestrator, these responsibilities fall to cron jobs, shell scripts, and someone's tribal knowledge. That works at very small scale. At any meaningful scale it becomes unmanageable.

The DAG: the core abstraction

Every orchestration framework models work as a Directed Acyclic Graph (DAG). A DAG is a set of tasks (nodes) connected by dependency edges (arrows). "Directed" means edges have direction (A must run before B). "Acyclic" means there are no cycles (A cannot depend on B if B depends on A — that would be an infinite loop).

extract_orders ──→ validate_orders ──→ transform_orders ──→ load_to_warehouse ──→ update_dashboard
                                                 ↑
extract_customers ──→ validate_customers ─────────┘

The orchestrator reads this graph and figures out the execution order — which tasks can run in parallel, which must wait, what to retry if something fails. This is the core value: you define the logic (what depends on what), the orchestrator handles the execution (how and when).

Apache Airflow: the standard

Apache Airflow is the most widely used orchestrator in data engineering. It defines DAGs in Python, which makes them flexible and programmable.

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator

default_args = {
    'owner': 'data-team',
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True,
    'email': ['alerts@yourcompany.com'],
}

with DAG(
    dag_id='orders_pipeline',
    schedule_interval='0 6 * * *',  # daily at 06:00 UTC (cron syntax)
    start_date=datetime(2024, 1, 1),
    default_args=default_args,
    catchup=False,
) as dag:

    extract = PythonOperator(
        task_id='extract_orders',
        python_callable=extract_orders_from_postgres,
    )

    transform = PostgresOperator(
        task_id='transform_orders',
        postgres_conn_id='warehouse',
        sql='sql/transform_orders.sql',
    )

    load = PythonOperator(
        task_id='load_to_warehouse',
        python_callable=load_orders_to_warehouse,
    )

    # Define dependencies
    extract >> transform >> load

The >> operator sets the execution order: extract, then transform, then load. Airflow's UI shows a graph view of this DAG, the run history, logs for each task, and retry status.

Cron syntax — understanding schedules

Schedules in most orchestrators use cron syntax. It has five fields:

┌──────────── minute (0–59)
│  ┌─────────── hour (0–23)
│  │  ┌────────── day of month (1–31)
│  │  │  ┌───────── month (1–12)
│  │  │  │  ┌────── day of week (0–6, 0=Sunday)
│  │  │  │  │
*  *  *  *  *

Common schedules:
0 6 * * *     → daily at 06:00 UTC
0 */4 * * *   → every 4 hours
30 6 * * 1    → every Monday at 06:30 UTC
0 6 1 * *     → first of every month at 06:00 UTC

Beyond Airflow: Prefect, Dagster, and cloud-native options

Airflow is the incumbent, but it has real weaknesses: the UI is dated, local development is complex, and Python-defined DAGs can be cumbersome for simple use cases.

Prefect modernises the Airflow approach. Flows and tasks are plain Python functions decorated with @flow and @task. Dependencies are inferred from function calls rather than explicitly defined with operators.

from prefect import flow, task

@task(retries=3, retry_delay_seconds=60)
def extract_orders():
    ...

@task
def transform_orders(orders):
    ...

@task
def load_orders(transformed):
    ...

@flow(name="orders-pipeline")
def orders_pipeline():
    orders = extract_orders()
    transformed = transform_orders(orders)
    load_orders(transformed)

Dagster introduces the concept of assets — instead of defining tasks, you define the data objects each task produces. This makes the orchestration layer aware of what data exists and lets you think about freshness ("is this asset up to date?") rather than just task scheduling.

Cloud-native options: if you're already on AWS, Azure, or GCP, their managed services (AWS Step Functions, Azure Data Factory, Cloud Composer) offer orchestration without the operational overhead of running your own Airflow cluster.

Sensors and cross-pipeline dependencies

Not all triggers are time-based. Sometimes Pipeline B should start when Pipeline A completes, or when a file lands in S3, or when a table has been updated. Orchestrators handle this with sensors:

from airflow.sensors.s3_key_sensor import S3KeySensor
from airflow.sensors.external_task import ExternalTaskSensor

# Wait for a file to land in S3
wait_for_file = S3KeySensor(
    task_id='wait_for_source_file',
    bucket_name='my-data-lake',
    bucket_key='incoming/orders_{{ ds }}.csv',
    poke_interval=300,   # check every 5 minutes
    timeout=3600,        # fail after 1 hour if no file
)

# Wait for another DAG to complete
wait_for_ingestion = ExternalTaskSensor(
    task_id='wait_for_ingestion_dag',
    external_dag_id='ingestion_pipeline',
    external_task_id='load_complete',
    poke_interval=60,
)

wait_for_file >> wait_for_ingestion >> transform

Observability: the other half of orchestration

An orchestrator is only as useful as the visibility it provides. You need to know, at a glance: what ran successfully, what failed, what is currently running, and how long things typically take.

The minimum observability setup for a production pipeline:

Failure alerts: any task failure sends an immediate notification (email, Slack, PagerDuty). Don't rely on people checking a dashboard — they won't.
SLA monitoring: if a job hasn't completed by a certain time (e.g., the dashboard must be ready by 08:00), alert when the SLA is at risk, not after it's missed.
Log accessibility: full task logs must be viewable from the UI. Debugging requires logs — if they're hard to get to, the tool isn't usable.
Run history: you need to see the last N runs, their duration, and their status. Spotting an anomaly (a job that normally takes 5 minutes is taking 45) requires history to compare against.

The orchestrator is not the pipeline

A common misconception: the orchestrator is not where the actual data processing happens. The orchestrator is the scheduler and coordinator. The processing happens in:

Python functions running on Airflow worker nodes
SQL queries running in the data warehouse
dbt models executing against BigQuery or Snowflake
Spark jobs running on EMR, Databricks, or a Kubernetes cluster

The orchestrator's job is to say "run this SQL now, on that system, and tell me when it's done". The heavy lifting happens elsewhere. This separation keeps the orchestrator lightweight and the compute scalable.

Where this series ends and the work begins

This post completes the Data Engineering Fundamentals series. We've covered what data engineers do, pipeline anatomy, storage architectures, file formats, open table formats, extraction patterns, CDC, batch vs streaming, SQL, dbt, the medallion architecture, idempotency, and orchestration.

These fifteen posts are the foundation. What comes next is the application: building systems that combine these concepts, making tradeoffs between them, and operating them reliably at scale. That's where data engineering moves from knowledge to craft — and that's what the rest of this site is here to help with.