Skip to main content

Stream and Batch Processing Frameworks

Why such frameworks?

  • process high-throughput in low latency.
  • fault-tolerance in distributed systems.
  • generic abstraction to serve volatile business requirements.
  • for bounded data set (batch processing) and for unbounded data set (stream processing).

Brief history of batch/stream processing

  1. Hadoop and MapReduce. Google made batch processing as simple as MR result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
  2. Apache Storm and DAG Topology. MR doesn’t efficiently express iterative algorithms. Thus Nathan Marz abstracted stream processing into a graph of spouts and bolts.
  3. Spark in-memory Computing. Reynold Xin said Spark sorted the same data 3X faster using 10X fewer machines compared to Hadoop.
  4. Google Dataflow based on Millwheel and FlumeJava. Google supports both batch and streaming computing with the windowing API.
  1. its fast adoption of ==Google Dataflow==/Beam programming model.
  2. its highly efficient implementation of Chandy-Lamport checkpointing.

How?

Architectural Choices

To serve requirements above with commodity machines, the steaming framework use distributed systems in these architectures...

  • master-slave (centralized): apache storm with zookeeper, apache samza with YARN.
  • P2P (decentralized): apache s4.

Features

  1. DAG Topology for Iterative Processing. e.g. GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
  2. Delivery Guarantees. How guaranteed to deliver data from nodes to nodes? at-least once / at-most once / exactly once.
  3. Fault-tolerance. Using cold/warm/hot standby, checkpointing, or active-active.
  4. Windowing API for unbounded data set. e.g. Stream Windows in Apache Flink. Spark Window Functions. Windowing in Apache Beam.

Comparison

FrameworkStormStorm-tridentSparkFlink
Modelnativemicro-batchmicro-batchnative
Guarenteesat-least-onceexactly-onceexactly-onceexactly-once
Fault-toleranceRecord-Ackrecord-ackcheckpointcheckpoint
Overhead of fault-tolerancehighmediummediumlow
latencyvery-lowhighhighlow
throughputlowmediumhighhigh
References: