Please enable JavaScript.
Coggle requires JavaScript to display documents.
Kinesis Family - Coggle Diagram
Kinesis Family
Kinesis Data Streams
Features:
- Low latency streaming ingest at scale
- Streams are divided in ordered Shards / Partitions
- Data retention is 24 hours by default, can go up to 7 days
- Ability to reprocess / replay data
- Multiple applications can consume the same stream
- Real time processing with scale of throughput
- Immutability: once data is inserted in Kinesis, it can’t be deleted
Kinesis Data Streams Shards:
- Billing is per shard provisioned, can have as many shards as I want
- Batching available or per message calls
- Number of shards can evolve over time (reshard /merge)
- Kinesis Data Streams Records are ordered per shard
Kinesis Data Streams Record:
- Consists of 3 components (image)
- Data Blob: data being sent, serialized as bytes . Up to 1 MB. Can represent anything
- Record Key:
- sent alongside a record, helps to group. records in Shards. Same key = Same shard
- use a highly distributed key to avoid the hot partition problem
- e.g. User ID
- Sequence number: unique identifier for each records put in shards. Added by Kinesis after ingestion
Kinesis Data Streams Limitations:
- Producer:
- 1MB/s or 1000 messages/s at write PER SHARD
- ProvisionedThroughputException if go over the limit
- 2 types of Consumer
- Consumer Classic:
- 2MB/s at read PER SHARD across all consumers
- 5 API calls per second PER SHARD across all consumers
- Consumer Enhanced Fan Out:
- 2MB/s at read PER SHARD, PER ENHANCED CONSUMER
- No API calls needed (push model)
- Scale better
Kinesis Producers: (image)
- 4 types of data producers to Kinesis Data Streams
- Kinesis SDK
- Kinesis Producer Library (KPL)
- Kinesis Agent
- 3rd party libraries: Spark, Log4J Appenders, Flume, Kafka Connect, NiFi
Kinesis Producer SDK:
- PutRecord (1) and PutRecords (>1)
- PutRecords uses batching and increases throughput -> less HTTP
requests
- ProvisionedThroughputExceeded if go over the limit
- Can be used application or on mobile device -> AWS Mobile SDK: Android, iOS
- Use cases: low throughput, higher latency, simple API, AWS Lambda
- Managed AWS resources that uses Kinesis Producer SDK behind the scene:
- CloudWatch Logs
- AWS IoT
- Kinesis Data Analytics
How to handle ProvisionedThroughputExceeded Exception ?
- Exceeding the number of MB/s or the number of records/s I can send to any Shard
- Make sure I don’t have a hot shard (such as my partition key is bad and too much data goes to that partition)
- Solution:
- Retries with back off (retries after X seconds)
- Increase shards (scaling)
- Ensure your partition key is a good one (e.g. use Device ID instead of iOS or Android)
Kinesis Producer Library (KPL):
- Easy to use and highly configurable C++ / Java library
- Used for building high performance, long running producers
- Automated and configurable retry mechanism
- Synchronous or Asynchronous API (better performance for Asynchronous)
- Submits metrics to CloudWatch for monitoring
- Batching (both turned on by default) increase throughput, decrease cost:
- 6.1 Collect records and write to multiple shards in the same PutRecords API call
- 6.2 Aggregate increased latency, increased efficiency
- Capability to store multiple records in one record (go over 1000 records per second limit)
- Increase payload size and improve throughput (maximize 1MB/s limit)
- Compression must be implemented by the user
- KPL records must be decoded with KCL or special helper library
How Kinesis Producer Library (KPL) do Batching through Collect and Aggregate ? (image)
- We can influence the batching efficiency by introducing some delay with RecordMaxBufferedTime (default 100ms)
Kinesis Agent:
- Monitor Log files and sends them to Kinesis Data Streams
- Java based agent, built on top of KPL
- Install in Linux based server environments
- Write from multiple directories and write to multiple streams
- Routing feature based on directory / log file
- Pre-process data before sending to streams (single line, csv to json, log to json)
- Agent handles file rotation, checkpointing, and retry upon failures
- Emits metrics to CloudWatch for monitoring
Kinesis Consumers - Classic: (image)
- 6 types of data consumers from Kinesis Data Streams
- Kinesis SDK
- Kinesis Client Library (KCL)
- Kinesis Connector Library
- 3rd party libraries: Spark, Log4J Appenders, Flume, Kafka Connect
- Kinesis Data Firehose
- Lambda
Kinesis Consumer SDK - GetRecords:
- Classic Kinesis: Records are polled by consumers from a shard
- Each shard has 2 MB total aggregate throughput
- e.g. 3 shards will have 6MB of aggregated throughput, but each shard itself will get 2MB for its own
- GetRecords returns up to 10MB of data (then I will need to wait for 5 seconds until I get another set of records) or up to 1000 records
- Maximum of 5 GetRecords API calls per shard per second (can only get records 5 times per second) = 200ms latency
- If 5 consumers application consume from the same shard, means every consumer can poll once a second and receive less than 400 KB/s (image)
Kinesis Client Library (KCL):
- Java first library but exists for other languages too (Golang, Python, Ruby, Node, .NET)
- Read records from Kinesis produced with the KPL (de-aggregation)
- Share multiple shards with multiple consumers in one “group”, shard discovery (Kinesis data streams to consume by multiple applications as a group)
- Checkpointing feature to resume progress
- Leverages DynamoDB for coordination and checkpointing (one row per shard) (image)
- Make sure you provision enough WCU / RCU
- Or use On Demand for DynamoDB
- Otherwise DynamoDB may slow down KCL
- Record processors will process the data
- ExpiredIteratorException => increase WCU
Kinesis Connector Library:
- Older Java library (2016), leverages the KCL library
- Write data to (image)
- Amazon S3 (can be replaced by Kinesis Data Firehose)
- DynamoDB (can be replaced by Lambda)
- Redshift (can be replaced by Kinesis Data Firehose)
- ElasticSearch (can be replaced by Kinesis Data Firehose)
- Must run on EC2
Lambda sourcing from Kinesis:
- Lambda can read records from Kinesis Data Streams
- Lambda consumer has a library to de-aggregate record from the KPL
- Lambda can be used to run lightweight ETL to S3, DynamoDB, Redshift, ElasticSearch
- Lambda can be used to trigger notifications / send emails in real time
- Lambda has a configurable batch size (more in Lambda section)
Kinesis Consumers - Enhanced Fan Out:
- Each Consumer get 2 MB/s of provisioned throughput per shard (image)
- That means 20 consumers will get 40MB/s per shard aggregated
- Kinesis pushes data to consumers over HTTP/2
- Reduce latency (~70ms)
Enhanced Fan-Out vs Standard Consumers:
- Standard consumers:
- Low number of consuming applications
- Can tolerate ~200ms latency
- Minimize cost
- Enhanced Fan Out Consumers:
- Multiple Consumer applications for the same Data Stream
- Low Latency requirements ~70ms
- Higher costs
- Default limit of 5 consumers using enhanced fan out per Data Stream
Kinesis Operations - Adding Shards:
- Also called Shard Splitting
- Can be used to increase the Data Stream capacity (1 MB/s data in per shard)
- Can be used to divide a Hot shard (image)
- Old shard is closed and will be deleted once the data is expired
Kinesis Operations - Merging Shards:
- Decrease the Stream capacity and save costs
- Can be used to group two shards with low traffic (image)
- Old shards are closed and deleted based on data expiration
Kinesis Operations Auto Scaling:
- Auto Scaling is not a native Kinesis feature
- The API call to change the number of shards is UpdateShardCount
- Can implement Auto Scaling with Lambda
Kinesis Scaling Limitations:
- Resharding cannot be done in parallel. Plan capacity in advance
- Can only perform 1 resharding operation at a time and it takes a few seconds
- For 1000 shards, it takes 30K seconds (8.3 hours) to double to 2000 shards
Kinesis Data Firehose
Features:
- Fully Managed Service, no administration
- Near Real Time (60 seconds latency minimum for non full batches)
- Get data from SDK KPL, Kinesis Agent, Kinesis Data Streams, CloudWatch Logs & Events, IoT Rules Actions
- Load data into Redshift / Amazon S3 / ElasticSearch / Splunk (image)
- Automatic scaling
- Supports multiple data formats
- Data Conversions from JSON to Parquet / ORC (only for S3)
- Data Transformation from CSV to JSON through Lambda
- Supports compression when target is S3 (GZIP, ZIP, and SNAPPY)
- Only GZIP is the data is further loaded into Redshift
- Pay for the amount of data going through Firehose
- Kinesis Data Firehose Delivery example with Redshift target (image)
Kinesis Data Firehose Buffer Sizing:
- Firehose accumulates records in a buffer
- Buffer is flushed based on time and size rules
- Buffer Size (e.g. 32MB): if that buffer size is reached, it’s flushed
- Buffer Time (ex: 2 minutes): if that time is reached, it’s flushed
- Firehose can automatically increase the buffer size to increase throughput
- High throughput => Buffer Size will be hit
- Low throughput => Buffer Time will be hit
Kinesis Data Analytics
Features:
- Perform real-time analytics on stream using SQL
Features:
- Managed alternative to Apache Kafka
- Great for:
- Application logs, metrics, IoT, clickstreams
- real-time bug data
- streaming processing frameworks (Spark, NiFi)
- Highly available: data is automatically replicated synchronously to 3 AZs
- Kinesis overview (image)
Kinesis Security:
- Control access/ authorization using IAM policies
- Encryption in flight using HTTPS endpoints
- Encryption at rest using KMS
- Client side encryption must be manually implemented (harder)
- VPC Endpoints available for Kinesis to access within VPC
-