Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Pipeline (Requirements (Low Latency, Scalability, Versioning,…

- - - - RDD
        A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.
      - DAG (Directed acyclic graph
        )
      - SparkContext
      - Transformation
      - Actions
  - - - Considerations
        
        Partitioning
        
        Raw Data vs. Processed
        
        Data Format (CSV -> Parquet)
        
        Logical (Period, Customer, Type)
        
        Format
        
        File Size
    - - Kinesis
        
        Firehose
        
        Stream
        
        Analysis (over firehose)
      - s3 APIs (files)
    - - AWS Glue
        
        Crawler (Schema/Metadata)
        
        ETL (limited and could be costly)
      - EMR
      - Airflow
    - - Ahtena
      - Redshift (Spectrum)
      - Quick Sight
    - - Redshift
      - Dynamo DB
      - RBS
      - Elastic Search