Please enable JavaScript.

Coggle requires JavaScript to display documents.

DST Lecture 5 (Streaming Data Processing (Apache Apex) (structure…

- - - - distributed tasks on multiple machines
      - each machine can run multiple tasks
      - YARN for execution tasks
      - HDFS for persistent state
  - - - at least once
        
        downstream operators restarted
        
        upstream operators replayed
      - at most once
        
        assume data can be lost
        
        restart operator
        
        subscribe to new data from upstream
      - exactly once
  - - - thread local (intra-thread)
      - container local (intra-process)
      - node local (inter-process, same hadoop node)
      - rack local (inter-node)
- - - - set of processing elements
      - connected in series
      - output of one element is input of next one
      - carry out data processing job
- - - - multiple executors (thread)
      - executors process tasks
      - stream grouping defines distribution of tuples among tasks
- - - - pull data
      - keep single pointer indicating position in a partition
      - control their own speed of processing
      - organised in groups
        
        message is delivered to group
        
        each partition is assigned to one consumer
        
        only one consumer of group consumes message
        
        different consumers in different groups can retrieve same data
        
        allow queuing on top of publish/subscribe
    - - consist of partitions
        
        ordered, immutable sequences of records
        
        records have unique id within partition
        
        distributed among servers (fault tolerance)
        
        allows for fault tolerance
        
        leader handles read and write requests
        
        followers replicate
        
        follower automatically takes over in case of error
        
        define parallelism (equal to number of partitions)
        
        ordering guarantee holds withing partition
      - publish/subscribe semantics
      - persistent