Please enable JavaScript.
Coggle requires JavaScript to display documents.
Apache Spark, Stream processor features (Delivery Guarantee feature (At…
Apache Spark
Stream processor features
Delivery Guarantee feature
At-least-once
if message is not process then it will resend the message
Message cannot be lost but may be redelivered.
Streaming job must be idempotent
Exactly-once
most complex delivery guarantee
Messages will never be lost and will be processed exactly once.
Worker nodes must respond with a success or failure after a message is processed
At-most-once
if not processed then lost forever
State management
Stateless stream processing
Stateful stream processing
in-memory
Replicated Queryable Persistence Storage
Fault Tolerance
State machine
achieved through replication
Rollback Recovery
Streaming Algorithms
Streaming algorithms
Stream mining
Data stream query types
Potentially inifinte
high amount of data
high speed of data
a processed element is discarded or archived
Constraints
data stream computation model
single pass : each record is examined once
synopsis - stored in memory
bounded storage - limited memory
real time: a record must be processed as fast as possible
approximate answer: streaming algorithms compute approximate answers
Concept drift -
Load shedding
Sream time and event time
sliding window technique
Tumbling window
Time based tumbling
count based tumbling
Notorious data streaming algorithns
Reservoir sampling (random sampling)
HyperLogLog: count me once and fast. (distinct)
Count min-Sketch: How many times has stream record X occurred
Point query
Interest in certain stream item.
Range Query
interested in frequencies in a given range
Inner Query
Interested in joining the sizes of 2 sketches
Bloom Filter ( can produce false positive): Has item X ever occured in the stream before?