Every data pipeline makes a choice that shapes every downstream decision: is the data processed in batches, or does it flow continuously? This is not just a technical preference — it's a fundamental architectural decision with real consequences for latency, cost, complexity, and what you can build on top.
The core tradeoff
Batch processing collects data over a time window (an hour, a day) and processes it all at once. Stream processing ingests and processes each event as it arrives, within milliseconds or seconds.
Batch: [──── 1 hour of events ────] → process → output
Stream: event → process → output (per event, continuously)
The tradeoff is fundamentally between simplicity and latency.
Batch is simpler. The data is bounded — you know exactly what you're processing. Jobs can be rerun easily. Failures don't cascade. The entire toolchain (SQL, dbt, Spark in batch mode) is built around it.
Streaming is lower latency. Events are processed as they happen. But the data is unbounded — you never have a "complete" window. Handling late-arriving events, out-of-order events, and exactly-once processing requires significantly more complexity.
When batch is right
Most data engineering work is batch. The business question doesn't need a real-time answer, and the engineering tradeoff doesn't justify the complexity.
Batch is the right choice when:
- Data freshness of hours is acceptable. "Yesterday's sales report" is a batch workload. Daily, hourly, or even 15-minute schedules cover the vast majority of analytical use cases.
- The source itself is batched. Files dropped on an SFTP server, nightly database exports, daily API calls — if data only becomes available in batches, streaming doesn't help.
- The transformation is complex. Heavy aggregations, large joins, multi-step transformations are easier to implement, test, and debug in batch. SQL-based transformations in dbt are inherently batch.
- Cost is a concern. Streaming infrastructure (Kafka, Flink clusters, stream processing compute) is more expensive to run than equivalent batch compute.
When streaming is right
Streaming is justified when latency is a genuine business requirement — not just a preference, but something that meaningfully changes outcomes.
Streaming is the right choice when:
- Seconds matter to the end user. Fraud detection that flags a transaction must act before the transaction clears. A recommendation engine that updates as a user browses must be faster than batch.
- You're reacting to events. Sending a notification when an order ships, triggering a downstream process when a payment completes — event-driven architectures need event streams.
- Data volume is too high to batch efficiently. A billion events per day from IoT sensors or clickstreams may be cheaper to process continuously in a streaming engine than to buffer and batch.
Micro-batch: the pragmatic middle ground
Most teams don't need true millisecond-latency streaming. What they need is "fresher than hourly" — data that's current within 5 minutes rather than the next day. This is the space where micro-batch lives.
Micro-batch runs the same batch job but on a much shorter schedule — every 5 minutes, every minute. Each run processes a small window of new data. The latency is the batch interval (5 minutes) plus the job runtime (30 seconds) — not milliseconds, but often good enough.
# A micro-batch ingestion job
# Run every 5 minutes via Airflow or a cron schedule
from datetime import datetime, timedelta
def run_micro_batch():
now = datetime.utcnow()
window_start = now - timedelta(minutes=5)
# Extract events from the last 5 minutes
events = extract_events(since=window_start, until=now)
# Transform and load
if events:
transformed = transform(events)
load_to_warehouse(transformed)
Spark Structured Streaming implements micro-batch processing as its default mode — it runs a streaming job as a sequence of very short batch jobs, combining streaming semantics with batch reliability.
Apache Kafka: the ingestion backbone
Whether you're doing true streaming or micro-batch, Kafka often appears as the ingestion layer for high-volume event data. Understanding it at a conceptual level is important even if you're not building with it directly.
Kafka is a distributed event log. Producers write events to named topics. Each topic is divided into ordered partitions. Consumers read from partitions at their own pace, tracking their position (offset) independently. Events are retained for a configurable period (days, weeks) — not consumed and deleted like a traditional queue.
Topic: orders
Partition 0: [e1, e2, e5, e7, e9, ...]
Partition 1: [e3, e4, e6, e8, ...]
Consumer A (fraud detection): reading at offset 9 in P0, offset 8 in P1
Consumer B (warehouse ingestion): reading at offset 5 in P0, offset 4 in P1
The key property: multiple consumers can read the same topic independently, at different speeds. The fraud detection service and the warehouse ingestion job both consume the orders topic — at different latencies, with different processing logic, without interfering with each other.
Choosing between batch, micro-batch, and streaming
A practical decision framework:
- What latency does the business need? If daily is fine, use daily batch. If hourly is fine, use hourly batch. Only move to lower latency when there's a specific requirement for it.
- What latency does the source support? If data is only available daily (file drops, scheduled exports), streaming doesn't help regardless of what you want.
- What is the operational cost? A streaming Kafka + Flink stack requires dedicated expertise to operate reliably. A dbt + Airflow batch stack is much simpler. Only take on the streaming complexity if the latency requirement justifies it.
In practice: most analytical pipelines (reporting, dashboards, data science) are batch or micro-batch. Streaming is reserved for operational use cases where latency directly affects user experience or business outcomes — fraud detection, real-time personalisation, operational monitoring.
Start with batch. Move to micro-batch if the freshness requirement tightens. Move to streaming only when the business genuinely requires it and you have the engineering capacity to operate it.