Please enable JavaScript.

Coggle requires JavaScript to display documents.

First Kinesis stream (EMR cluster (Spark consumer (DynamoDb (Hive (EMR…

- - - - Hive
        
        EMR cluster
        
        We have a Dstream object as the stream variable which represents a sequence of RDD and we can implement functions for it
        
        Transformation function https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
        
        transform(func) Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
        
        cogroup(otherStream, [numTasks]) When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
        
        join(otherStream, [numTasks]) When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
        
        reduceByKey(func, [numTasks]) When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
        
        countByValue() When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
        
        reduce(func) Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
        
        count() Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
        
        union(otherStream) Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
        
        repartition(numPartitions) Changes the level of parallelism in this DStream by creating more or fewer partitions.
        
        filter(func) Return a new DStream by selecting only the records of the source DStream on which func returns true.
        
        flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items.
        
        map(func) Return a new DStream by passing each element of the source DStream through a function func.
        
        updateStateByKey(func) Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.
        
        DataFrame and SQL Operations
        https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
        
        Spark producer
        
        Second kinesis stream
        
        1 more item...
        
        MLib operations https://spark.apache.org/docs/latest/streaming-programming-guide.html#mllib-operations
- - - - EMR
        
        Output
- - - - S3
        
        Glue
        
        Sagemaker
        
        Output