Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Engineering - Coggle Diagram
Kinesis
- Managed data streaming service
- Can be integrated with other services for data ingestion
- S3 and Redshift are popular ingestion points
Kinesis Data Stream
- Low latency streaming ingestion at scale
- Divide streams into shards/partitions, so multiple consumers can simultaneously consume them
- A shard supports writing 1MB/sec and 1,000 records/sec
- A shard supports reading 2MB/sec
-
modes
On-demand
- Shards scale automatically
Provisioned
- Number of shards are pre-configured and fixed
Kinesis Client Library
- Keep track of the KDS data ingestion
- Helps continuing any streaming over time
Input Sources
- Suitable for ingesting large data
- DynamoDB, etc.
-
-
Kinesis Data Firehose
- Store streaming data from KDS into various destination
- Has internal buffer with set size and timeout
- Data is flushed altogether if limit is reached
- S3, Redshift, ElasticSearch, Splunk, etc.
-
Not supported destinations
- RDS, DyanmoDB, Aurora, Outside AWS
- However, AWS Lambda or Kinesis Data Stream can be used as intermediary to support almost every service
Kinesis Data Analytics
- SQL based real-time analytics on streams
- Only support ingesting data from Data Stream and Data Firehose or S3 as the source
- Almost fully managed with low operational overhead
Kinesis Video Streams
- Output source cannot be S3 (while S3 is used in the background)
- Can consumed by EC2 instances
Managed Streaming for Apache Kafka (MSK)
- Fully managed auto-scaling Kafka on AWS
- Producer and Consumer code must be provided
- Almost any source and destination can be used (but the code has to be provided)
Elastic MapReduce (EMR)
- AWS managed Hadoop clusters
- Automatically provision EC2 instance
- Apache Hive can be used to download data from DyanmoDB
- S3 EMRFS natively supports downloading/uploading data from/to the S3
Nodes
Master
- Cluster management, coordination, etc.
Core
- Running tasks
- Storing data in long run
Task
- Optional node to increase processing speed
Instance
Uniform Instance Group
- Choose a single instance type
- Scale automatically but with same instance type
Instance Fleet
- Choose the target capacity
- Instance type is chosen automatically by EMR
AWS Glue
- Serverless data integration service
- Extract, Transform, and Load from various sources
Crawler
- A task that probe the data source and create AWS Glue Data Catalog with metadata table definition
AWS Glue Job
- User written script that extract, transform, and load the data
- Can be scheduled or react to events
Glue Data Catalog
- Persistent metadata written by the Crawler
Redshift
- Online analytical processing
- Fast data warehousing
- Data processing with large scale
Redshift Workload Management (WLM)
- Treat the query submitted by different users based on the priority
- Queries are scheduled in dedicated queues depending on the urgency, length, and the authority of the user
Redshift Concurrency Scaling
- Besides the configured number of compute nodes, concurrency-scaling cluster will be provisioned on high demand
- Additional nodes will provision automatically in the concurrency-scaling cluster as needed
Notable Commands
UNLOAD
- Export the query result to the S3 bucket
COPY
- Copy data from S3 (or other sources) into Redshift
VACUUM
- Reclaim disk usage consumed by the deleted rows