Please enable JavaScript.
Coggle requires JavaScript to display documents.
DST Lecture 5 (Streaming Data Processing (Apache Apex) (structure…
DST Lecture 5
Streaming Data Processing (Apache Apex)
Batch vs. Stream vs. Interactive Analytics
centralised processing
only queries / pattern written
single processing unit (possibly multiple hosts)
distributed processing
interconnected processing units
code processing events
topology needs to be designed
structure
directed acyclic graph of operators
data input/output operators
compute operators
different data sources through different connectors
stream connects output port to input port of operators
stream data is sliced into windows
scheduling and execution environments
distributed tasks on multiple machines
each machine can run multiple tasks
YARN for execution tasks
HDFS for persistent state
fault tolerance
operator checkpoints: operator state is saved
recovery
at least once
downstream operators restarted
upstream operators replayed
at most once
assume data can be lost
restart operator
subscribe to new data from upstream
exactly once
scalability
locality configuration for deployment
thread local (intra-thread)
container local (intra-process)
node local (inter-process, same hadoop node)
rack local (inter-node)
affinity and anti-affinity rules
Advanced Workflows/Data Pipeline Processing
access and coordinate
different compute services
data sources
deployment services
complex business logics
Analytics-as-a-service
metrics, user activities analytics, testing
dynamic analytics of business activities
workflows
set of coordinated activities
generic workflows of different categories of tasks
data workflows -> data pipeline
set of processing elements
connected in series
output of one element is input of next one
carry out data processing job
Representation
programming languages
descriptive languages (e.g. BPEL)
Streaming Analysis (Apache Storm)
data streams (unbounded sequence of tuples)
spout
source of streams
read tuples from external source
feed tuples to the topology
can emit multiple streams
unreliable vs reliable
bolt
represents processing functions
can emit multiple streams
run by worker process
multiple executors (thread)
executors process tasks
stream grouping defines distribution of tuples among tasks
Large-scale data analytics
Analytics-as-a-service
understand monitoring information
information for optimizing business
Big data analytics
at rest
in motion
Apache Kafka
usecases
producers generate lots of events
producers and consumers have different processing speeds
rich and diverse types of events
consumers might be on and off (fault tolerance support)
design
cluster of brokers deliver messages
durable messages and ordered delivery via partitions
online/offline consumers
pull data
keep single pointer indicating position in a partition
control their own speed of processing
organised in groups
message is delivered to group
each partition is assigned to one consumer
only one consumer of group consumes message
different consumers in different groups can retrieve same data
allow queuing on top of publish/subscribe
heavy filesystem use (message storage and caching)
topics
consist of partitions
ordered, immutable sequences of records
records have unique id within partition
distributed among servers (fault tolerance)
allows for fault tolerance
leader handles read and write requests
followers replicate
follower automatically takes over in case of error
define parallelism (equal to number of partitions)
ordering guarantee holds withing partition
publish/subscribe semantics
persistent