Data Pipeline
Architecture
Data Sources
Apache Spark
AWS
Spark SQL
Spark Streaming
MLib (Machine Learning)
GraphX (Graph Processing)
Knime
Concepts and Abstractions
RDD
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.
DAG (Directed acyclic graph
)
SparkContext
Transformation
Actions
Requirements
Low Latency
Scalability
Versioning
Monitoring
Quality
Data Types
Raw Data (Bronze)
Processed Data (Silver)
Cooked Data (Gold)
Data Lake
Considerations
Partitioning
Raw Data vs. Processed
Data Format (CSV -> Parquet)
Logical (Period, Customer, Type)
Format
File Size
Ingest
Kinesis
Firehose
Stream
Analysis (over firehose)
ETL
AWS Glue
Crawler (Schema/Metadata)
ETL (limited and could be costly)
EMR
Analysis
Ahtena
Redshift (Spectrum)
Quick Sight
s3 APIs (files)
Store
Redshift
Dynamo DB
RBS
Elastic Search
Airflow