Tutorial Data Engineering

Batch vs streaming — choosing the right ingestion model

SkyDeLake Jun 28, 2026 5 min read 0 views

Data Engineering Fundamentals

View full path →
  1. What data engineers actually do
  2. The anatomy of a data pipeline
  3. Warehouses, lakes, and lakehouses — what they are and when each wins
  4. Why Parquet? File formats explained
  5. Partitioning data at scale
  6. Predicate pushdown and column pruning — how query engines skip work
  7. Delta Lake, Iceberg, Hudi — open table formats explained
  8. Extracting data from anywhere — APIs, databases, and files
  9. Change Data Capture explained
  10. Batch vs streaming — choosing the right ingestion model
  11. SQL for data engineers
  12. dbt from first principles
  13. The medallion architecture — Bronze, Silver, Gold
  14. Idempotency — the most important property a data pipeline can have
  15. Orchestration fundamentals

Every data pipeline makes a choice that shapes every downstream decision: is the data processed in batches, or does it flow continuously? This is not just a technical preference — it's a fundamental architectural decision with real consequences for latency, cost, complexity, and what you can build on top.

The core tradeoff

Batch processing collects data over a time window (an hour, a day) and processes it all at once. Stream processing ingests and processes each event as it arrives, within milliseconds or seconds.

Batch:    [──── 1 hour of events ────] → process → output
Stream:   event → process → output (per event, continuously)

The tradeoff is fundamentally between simplicity and latency.

Batch is simpler. The data is bounded — you know exactly what you're processing. Jobs can be rerun easily. Failures don't cascade. The entire toolchain (SQL, dbt, Spark in batch mode) is built around it.

Streaming is lower latency. Events are processed as they happen. But the data is unbounded — you never have a "complete" window. Handling late-arriving events, out-of-order events, and exactly-once processing requires significantly more complexity.

When batch is right

Most data engineering work is batch. The business question doesn't need a real-time answer, and the engineering tradeoff doesn't justify the complexity.

Batch is the right choice when:

When streaming is right

Streaming is justified when latency is a genuine business requirement — not just a preference, but something that meaningfully changes outcomes.

Streaming is the right choice when:

Micro-batch: the pragmatic middle ground

Most teams don't need true millisecond-latency streaming. What they need is "fresher than hourly" — data that's current within 5 minutes rather than the next day. This is the space where micro-batch lives.

Micro-batch runs the same batch job but on a much shorter schedule — every 5 minutes, every minute. Each run processes a small window of new data. The latency is the batch interval (5 minutes) plus the job runtime (30 seconds) — not milliseconds, but often good enough.

# A micro-batch ingestion job
# Run every 5 minutes via Airflow or a cron schedule

from datetime import datetime, timedelta

def run_micro_batch():
    now = datetime.utcnow()
    window_start = now - timedelta(minutes=5)

    # Extract events from the last 5 minutes
    events = extract_events(since=window_start, until=now)

    # Transform and load
    if events:
        transformed = transform(events)
        load_to_warehouse(transformed)

Spark Structured Streaming implements micro-batch processing as its default mode — it runs a streaming job as a sequence of very short batch jobs, combining streaming semantics with batch reliability.

Apache Kafka: the ingestion backbone

Whether you're doing true streaming or micro-batch, Kafka often appears as the ingestion layer for high-volume event data. Understanding it at a conceptual level is important even if you're not building with it directly.

Kafka is a distributed event log. Producers write events to named topics. Each topic is divided into ordered partitions. Consumers read from partitions at their own pace, tracking their position (offset) independently. Events are retained for a configurable period (days, weeks) — not consumed and deleted like a traditional queue.

Topic: orders
  Partition 0: [e1, e2, e5, e7, e9, ...]
  Partition 1: [e3, e4, e6, e8, ...]

Consumer A (fraud detection): reading at offset 9 in P0, offset 8 in P1
Consumer B (warehouse ingestion): reading at offset 5 in P0, offset 4 in P1

The key property: multiple consumers can read the same topic independently, at different speeds. The fraud detection service and the warehouse ingestion job both consume the orders topic — at different latencies, with different processing logic, without interfering with each other.

Choosing between batch, micro-batch, and streaming

A practical decision framework:

  1. What latency does the business need? If daily is fine, use daily batch. If hourly is fine, use hourly batch. Only move to lower latency when there's a specific requirement for it.
  2. What latency does the source support? If data is only available daily (file drops, scheduled exports), streaming doesn't help regardless of what you want.
  3. What is the operational cost? A streaming Kafka + Flink stack requires dedicated expertise to operate reliably. A dbt + Airflow batch stack is much simpler. Only take on the streaming complexity if the latency requirement justifies it.

In practice: most analytical pipelines (reporting, dashboards, data science) are batch or micro-batch. Streaming is reserved for operational use cases where latency directly affects user experience or business outcomes — fraud detection, real-time personalisation, operational monitoring.

Start with batch. Move to micro-batch if the freshness requirement tightens. Move to streaming only when the business genuinely requires it and you have the engineering capacity to operate it.