Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Collection - Coggle Diagram
Data Collection
Real time:
- Immediate actions
- Kinesis Data Streams (KDS)
- Simple Queue Service (SQS)
- Internet of Things (IoT)
Kinesis vs IoT
Use Case Comparison:
- SQS:
- Order processing
- Image Processing
- Auto scaling queues according to messages
- Buffer and Batch messages for future processing
- Request offloading
- Kinesis Data Streams:
- Fast log and event data collection and processing
- Real Time metrics and reports
- Mobile data capture
- Real Time data analytics
- Gaming data feed
- Complex Stream Processing
- Data Feed from IoT
Features Comparison:
- Kinesis Data Stream vs SQS (excel)
- Kinesis Data Stream vs Kinesis Data Firehose vs SQS (excel)
IoT (image)
IoT Device Gateway: (image)
- Serves as the entry point for IoT devices connecting to AWS
- Allows devices to securely and efficiently communicate with AWS IoT
- Supports the MQTT, WebSockets, and HTTP 1.1 protocols
- Fully managed and scales automatically to support over a billion devices
- No need to manage any infrastructure
IoT Message Broker: (image)
- Pub/ Sub messaging pattern - low latency
- Devices can communicate with one another this way
- Messages sent using the MQTT, WebSockets, or HTTP 1.1 protocols
- Messages are published into topics (just like SNS)
- Message Broker forwards messages to all clients connected to the topic
IoT Thing Registry = IAM of IoT:
- All connected IoT devices are represented in the AWS IoT registry
- Organizes the resources associated with each device in the AWS Cloud
- Each device gets a unique ID
- Supports metadata for each device (e.g. Celsius vs Fahrenheit)
- Can create X.509 certificate to help IoT devices connect to AWS
- IoT Groups: group devices together and apply permissions to the group
Authentication:
- 3 possible authentication methods for Things:
- Create X.509 certificates and load them securely onto the Things
- AWS SigV4
- Custom tokens with Custom authorizers
- For mobile apps:
- Cognito identities (extension to Google, Facebook login)
- Web / Desktop / CLI:
Authorization:
- AWS IoT policies:
- Attached to X.509 certificates or Cognito Identities
- Able to revoke any device at any time
- IoT Policies are JSON documents
- Can be attached to groups instead of individual Things
- IAM Policies:
- Attached to users, group or roles
- Used for controlling IoT AWS APIs
Device Shadow: (image)
- JSON document representing the state of a connected Thing
- We can set the state to a different desired state (e.g. light on)
- IoT Thing will retrieve the state when online and adapt
Rules Engine:
- Rules are defined on the MQTT topics
- Rules = when it’s triggered
- Rule action = what is does
- Rules use cases:
- Augment or filter data received from a device
- Write data received from a device to a DynamoDB database
- Save a file to S3
- Send a push notification to all users using SNS
- Publish data to a SQS queue
- Invoke a Lambda function to extract data
- Process messages from a large number of devices using Kinesis
- Send data to the Elasticsearch Service
- Capture a CloudWatch metric and Change a CloudWatch alarm
- Send the data from an MQTT message to Amazon Machine Learning to make predictions based on an Amazon ML model
- Rules need IAM Roles to perform their actions
IoT Greengrass:
- Brings the compute layer to the device directly
- Can execute AWS Lambda functions on the devices:
- Pre-process the data
- Execute predictions based on ML models
- Keep device data in sync
- Communicate between local devices
- Operate offline
- Deploy functions from the cloud directly to the devices
-
Amazon Managed Streaming for Apache Kafka (Amazon MSK):
- Alternative to Kinesis
- Fully managed Apache Kafka on AWS
- Allow me to create, update, delete Kafka clusters. Cluster consists of one or more Kafka brokers (control plane)
- MSK creates & manages Kafka brokers nodes & Zookeeper nodes for me. Zookeeper is used to track the status of brokers nodes in the Kafka cluster
- Deploy the MSK cluster in my VPC, multi-AZs (up to 3 for HA)
- Automatic recovery from common Apache Kafka failures
- Data is stored on EBS volumes
- I build producers and consumers of data (data plane)
- Can create custom configurations for my clusters
- Default message size of 1MB
- Possibilities of sending large messages (e.g. 10MB) into Kafka after custom configuration
- Apache Kafka at a high level (image)
- Low latency in Kafka 10 - 40ms
- Kinesis Data Streams vs Amazon MSK (image)
MSK Configurations: (image)
- Choose the number of AZ (3 recommended, or 2)
- Choose the VPC & Subnets
- Broker instance type (e.g. kafka.m5.large)
- Number of brokers per AZ (can add brokers later)
- Size of your EBS volumes (1GB - 16TB)
MSK Security: (image)
- Encryption:
- Optional in flight using TLS between the brokers
- Optional in flight with TLS between the clients and brokers
- At rest for my EBS volumes using KMS
- Network Security:
- Authorize specific security groups for my Apache Kafka clients
- Authentication & Authorization (important):
- Define who can read/write to which topics
- Mutual TLS (AuthN) + Kafka ACLs (AuthZ)
- SASL/SCRAM (AuthN) + Kafka ACLs (AuthZ)
- IAM Access Control (AuthN + AuthZ)
MSK Monitoring:
- CloudWatch Metrics:
- Basic monitoring (cluster and broker metrics)
- Enhanced monitoring (++enhanced broker metrics)
- Topic level monitoring (++enhanced topic level metrics)
- Prometheus (Open Source Monitoring):
- Opens a port on the broker to export cluster, broker and topic level metrics
- Setup the JMX Exporter (metrics) or Node Exporter (CPU and disk metrics)
- Broker Log Delivery:
- Delivery to CloudWatch Logs
- Delivery to S3
- Delivery to Kinesis Data Streams
Near real time:
- Reactive actions
- Kinesis Data Firehose (KDF)
- Database Migration Service (DMS)
Database Migration Service (DMS): (image)
- Quickly and securely migrate databases to AWS, resilient, self healing
- Source database remains available during the migration
- Supports:
- Homogeneous migrations: e.g. Oracle to Oracle
- Heterogeneous migrations: e.g. Microsoft SQL Server to Aurora
- Continuous Data Replication using CDC (image)
- Must create an EC2 instance to perform the replication tasks
- DMS Sources and Targets (image)
AWS Schema Conversion Tool (SCT): (image)
- Convert Database’s Schema from one engine to another
- e.g. OLTP: (SQL Server or Oracle) to MySQL, PostgreSQL, Aurora
- e.g. OLAP: (Teradata or Oracle) to Redshift
- Prefer compute-intensive instances to optimize data conversions
- Do not need to use SCT if i am migrating the same DB engine
- e.g. On-Premise PostgreSQL => RDS PostgreSQL
- DB engine is still PostgreSQL (RDS is the platform)
Batch:
- Historical analysis
Snowball Family:
- Physical data transport solution that helps moving TBs or PBs of data in or out of AWS
- Alternative to move data over the network (and paying network fees)
- Secure, tamper resistant, uses KMS 256bit encryption
- Tracking using SNS and text messages. E-ink shipping label
- Pay per data transfer job
- Use cases: large data cloud migrations, DC decommission, disaster recovery
- Use Snowball devices if it takes more than a week to transfer over the network
- Upload to S3 Direct Connect vs Snowball (image)
Snow Family - Usage Process:
- Request Snowball devices from the AWS console for delivery
- Install the Snowball client (via CLI)/ AWS OpsHub (a software you install on my computer/ laptop to manage your Snow Family Device) on my servers
- Connect the Snowball to my servers and copy files using the client
- Ship back the device when I am done (goes to the right AWS facility)
- Data will be loaded into an S3 bucket
- Snowball is completely wiped
- Tracking is done using SNS, text messages and the AWS console
Snowball Edge (for data transfers):
- Snowball Edges add computational capability to the device
- Supports a custom E2 AMI so I can perform processing on the go
- Supports custom Lambda functions
- Very useful to pre-process the data while moving
- Use cases: data migration, image collation, IoT capture, machine learning
- Snowball Edge Storage Optimized:
- 24 vCPU
- 80TB of HDD capacity for block volume and S3 compatible object storage
- Snowball Edge Compute Optimized:
- 52 vCPU and optional GPU
- 42TB of HDD capacity for block volume and S3 compatible object storage
AWS Snowmobile:
- Transfer exabytes of data (1 EB = 1,000 PB = 1,000,000 TBs)
- Each Snowmobile has 100 PB of capacity (use multiple in parallel)
- High security: temperature controlled, GPS, 24/7 video surveillance
- Better than Snowball if you transfer more than 10 PB
Direct Connect (DX): (image)
- Provides a dedicated private connection from a remote network to my VPC
- Dedicated connection must be setup between my DC and AWS Direct Connect locations
- Need to setup a Virtual Private Gateway on my VPC
- Access public resources (S3) and private (EC2) on same connection
- Use Cases:
- Increase bandwidth throughput working with large data sets lower cost
- More consistent network experience applications using real time data feeds
- Hybrid Environments (on premise + cloud)
- Supports both IPv4 and IPv6
Direct Connect Gateway: (image)
- If I want to setup a Direct Connect to one or more VPCs in many different regions (same account), I must use a Direct Connect Gateway