Data Pipeline

Architecture

Data Sources

Apache Spark

AWS

Spark SQL

Spark Streaming

MLib (Machine Learning)

GraphX (Graph Processing)

Knime

Concepts and Abstractions

RDD
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

DAG (Directed acyclic graph
)
image

SparkContext

Transformation

Actions

image

Requirements

Low Latency

Scalability

Versioning

Monitoring

Quality

Data Types

Raw Data (Bronze)

Processed Data (Silver)

Cooked Data (Gold)

Data Lake

Considerations

Partitioning

Raw Data vs. Processed

Data Format (CSV -> Parquet)

Logical (Period, Customer, Type)

Format

File Size

Ingest

Kinesis

Firehose

Stream

Analysis (over firehose)

ETL

AWS Glue

Crawler (Schema/Metadata)

ETL (limited and could be costly)

EMR

Analysis

Ahtena

Redshift (Spectrum)

Quick Sight

s3 APIs (files)

Store

Redshift

Dynamo DB

RBS

Elastic Search

Airflow