Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Pipeline (Requirements (Low Latency, Scalability, Versioning,…
Data Pipeline
Requirements
Low Latency
Scalability
Versioning
Monitoring
Quality
Architecture
Apache Spark
Spark SQL
Spark Streaming
MLib (Machine Learning)
GraphX (Graph Processing)
Concepts and Abstractions
RDD
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.
DAG (Directed acyclic graph
)
SparkContext
Transformation
Actions
AWS
Data Lake
Considerations
Partitioning
Raw Data vs. Processed
Data Format (CSV -> Parquet)
Logical (Period, Customer, Type)
Format
File Size
Ingest
Kinesis
Firehose
Stream
Analysis (over firehose)
s3 APIs (files)
ETL
AWS Glue
Crawler (Schema/Metadata)
ETL (limited and could be costly)
EMR
Airflow
Analysis
Ahtena
Redshift (Spectrum)
Quick Sight
Store
Redshift
Dynamo DB
RBS
Elastic Search
Knime
Data Types
Raw Data (Bronze)
Processed Data (Silver)
Cooked Data (Gold)
Data Sources