Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Analysis - Coggle Diagram
Data Analysis
What is Redshift?
- Fully managed, petabyte-scale data warehouse service
- 10X better performance than other Data Warehouses
- Via machine learning, massively parallel query execution, columnar storage
- Designed for OLAP, not OLTP
- Cost effective, pay as you go model
- SQL, ODBC, JDBC interfaces
- Scale up or down on demand
- Built in replication for availability & backups for durability
- Automictically recover from component and node failures
- Monitoring via CloudWatch / CloudTrail
- Metrics, compute utilization, storage utilization, R/W traffic to the cluster
- User-defined custom metrics
- Information on query and cluster performance using AWS management console (diagnose performance issues like which user or query is consuming high resources)
- Use cases:
- Accelerate analytics workloads
- Unified data warehouse & data lake
- Data warehouse modernization
- Analyze global sales data
- Store historical stock trade data
- Analyze advertisement impressions & clicks
- Aggregate gaming data
- Analyze social trends
- New Redshift features for 2020+:
- RA3 nodes with managed storage
- Enable independent scaling of compute and storage
- Redshift data lake export
- Unload Redshift query to S3 in Apache Parquet format
- Parquet is 2x faster to unload and consumes up to 6X less storage
- Compatible with Redshift Spectrum, Athena, EMR, SageMaker
- Automatically partitioned
Redshift Spectrum:
- Query exabytes of unstructured data in S3 without loading
- Limitless concurrency
- Horizontal scaling
- Separate storage (in S3) & compute resources (in Spectrum)
- Wide variety of data formats (Avro, CSV, Grok, Ion, JSON, ORC, Parquet, RC file, Refx, Cirta, sequence files, text files, TSVs)
- Support of Gzip and Snappy compression
Redshift Performance:
- Massively Parallel Processing (MPP)
- data and query loads are automatically distributed across all nodes and
- adding nodes to the data warehouse is made easy
- enable fast query performance as the data warehouse grows so it will do all of its queries in parallel
- if I need more speed or capacity just add more nodes to my cluster
- Columnar Data Storage
- Look up data by column
- Columnar data store sequentially on the storage media and requires far less IO => improving query performance
- Each data block stores values of a single column from multiple rows
- Redshift is transparently converting the data to columnar storage for each of the columns
- Bock size of 1 MB
- Column Compression
- Automatically sample the data and select the most appropriate compression scheme when the data is loaded into an empty table
- Compression is a column level operation that reduce the size of the data when it is stored
- Conserve storage space and reduce data size that is read from the storage
- Reduce amount of disk IO => improves query performance
- Issuing COPY command
- Redshift will automatically analyze and apply compression
- Not possible to change the compression encoding for a column after the table is created
- The ANALYZE COMPRESSION command
- Perform compression analysis and produces a report with the suggested compression encoding for the tables analyzed
- For each column, the report includes an estimate of the potential reduction in disk space compared to the current encoding
Redshift Durability & Scalability:
- Redshift Durability:
- Replication within cluster
- Backup to S3
- Asynchronously replicated to another region
- 3 copies of the data are maintained (the Original, Replica, Compute nodes)
- Capability to automate snapshots of the data warehouse cluster with a 1 Day retention period by default
- Can extend to 35 days if I want
- Set to Zero mean turn off the automated backup
- Failed drives automatically replaced
- Redshift rebuilds that failed drive from a replica of the data on that drive which is stored on another drive within that node
- Failed nodes automatically replaced
- Redshift automatically detect the node failure and replace a failed node in my data warehouse cluster
- Until that replacement node is provisioned and added to the database, the cluster will be unavailable for queries and updates
- Most frequently access data from S3 is loaded first in the new node => resume querying that data as quickly as possible
- Single node cluster do not support data replication because there is nothing to replicate it to (in that case have to restore cluster from a snapshot in S3)
- AWS recommends at least 2 nodes in my cluster for production purposes
- Redshift limited to a single AZ
- Restore the cluster from any existing snapshot to a new AZ within the same Region
- How does Redshift scale?
- Vertical (increasing node instance type) and horizontal (increasing the number of nodes) scaling on demand
- requested changes will be apply immediately when I modify my data warehouse cluster
- Here is how the process works during scaling:
- A new cluster is created while your old one remains available for reads
- CNAME is flipped to new cluster (a few minutes of downtime)
- Redshift will move data from the Compute nodes in the existing data warehouse cluster in parallel to the Compute nodes in the new cluster
Redshift Distribution Styles:
- When data is loaded into a table, Redshift will distribute the table's rows to the Compute nodes and Slices according to the distribution style that I choose when I created the table
- 2 primary goals of data distribution are:
- to distribute the workload uniformly among the nodes in the cluster
- to minimize data movement during query execution
- 4 different distribution styles:
- AUTO
- When I do not specify a distribute style, Redshift uses AUTO distribution by default
- Redshift figures it out based on size of data
- e.g. Redshift initially assigns ALL distribution to a small table, then changes to EVEN distribution when the table grows larger
- EVEN (image)
- Leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column
- Appropriate when a table doesn't participate in joins
- Appropriate when there isn't a clear choice between KEY and ALL distribution
- KEY (image)
- Rows are distributed according to the values in one column
- Leader node places matching values on the same node slice
- If I distribute a pair of tables on the joining keys, the Leader node collocates the rows on the slices according to the values in the joining columns. This way, matching values from the common columns are physically stored together
- ALL (image)
- Copy of the entire table is distributed to every node
- Multiplies the storage required by the number of nodes in the cluster, and so it takes much longer to load, update, or insert data into multiple tables
- Appropriate only for relatively slow moving tables (tables that are not updated frequently or extensively)
- To view the distribution style applied to a table, query the PG_CLASS_INFO system catalog view or the SVV_TABLE_INFO view
Redshift Sort Keys:
- Can define one or more of table columns as sort keys
- Rows are stored on disk in sorted order based on the column you designate as a sort key
- Like an index
- Makes for fast range queries
- How to choose the best Sort Key?
- If recent data is queried most frequently => specify timestamp column as the Sort Key
- If I do frequent range filtering or equality filtering on 1 columns => specify that column as the Sort Key
- If frequently join a table => can specify the join column as both Source and Distribution key
- Single Sort Key:
- Use a single column a single value to sort the data
- Appropriate if frequently querying for data within a single filter column
- Compound Sort Key:
- COMPOUND is the default sort type
- Made up of all of the columns listed in the sort key definition, in the order they are listed
- Appropriate when a query's filter applies conditions, such as filters and joins, that use a prefix of the sort keys
- Performance benefits of compound sorting decrease when queries depend only on secondary sort columns, without referencing the primary columns
- Interleaved Sort Key
- Gives equal weight to each column, or subset of columns, in the sort key
- Appropriate if multiple queries use different columns for filters
Importing / Exporting data:
- COPY command (import data)
- Use COPY to load large amounts of data from outside of Redshift
- Parallelized; efficient (read from multiple data files or data streams simultaneously)
- Import data in from S3, EMR, DynamoDB, remote hosts using SSH
- For access control, can use role-based and key-based access control (provide authenticate to my cluster to perform load and unload operation)
- Use the COPY command to load a table in parallel from data files on S3. Can specify the files to be loaded by using an S3 object prefix or by using a manifest file
- e.g. using S3 object prefix
- copy <table_name>
from 's3://<bucket_name>/<object_prefix>'
authorization;
- e.g. using manifest file
- copy <table_name>
from 's3://<bucket_name>/<manifest_file>'
authorization
manifest;
- To authenticate using the IAM_ROLE parameter, replace <aws-account-id> and <role-name>
- copy customer
from 's3://mybucket/mydata'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
- To authenticate using IAM user credentials, replace <access-key-id> and <secret-access-key with an authorized user's access key ID and full secret access key for the ACCESS_KEY_ID and SECRET_ACCESS_KEY parameters
- copy customer
from 's3://mybucket/mydata'
access_key_id '<access-key-id>'
secret_access_key '<secret-access-key';
- UNLOAD command (export data)
- Unload from a table into files in S3
- Enhanced VPC routing
- Forces all COPY and UNLOAD traffic between my cluster and my data repositories through my VPC based on the VPC service
- If enhanced VPC routing is not enabled, Redshift routes traffic through the internet, including traffic to other services within the AWS network
- By using enhanced VPC routing, I can use standard VPC features, such as VPC security groups, NACLs, VPC endpoints, VPC endpoint policies, internet gateways, and DNS servers
COPY command More depth:
- If your data is already in Redshift in another table,
- Use INSERT INTO …SELECT
- Or CREATE TABLE AS
- COPY can decrypt data as it is loaded from S3
- Hardware accelerated SSL used to keep it fast
- Gzip , lzop , and bzip2 compression supported to speed it up further
- Automatic compression option
- Analyzes data being loaded and figures out optimal compression scheme for storing it
- Special case: narrow tables (lots of rows, few columns)
- Load with a single COPY transaction if possible
- Otherwise hidden metadata columns consume too much space
Redshift copy grants for cross-region snapshot copies:
- Let’s say you have a KMS encrypted Redshift cluster and a
snapshot of it
- You want to copy that snapshot to another region for backup
- In the destination AWS region:
- Create a KMS key if you don’t have one already
- Specify a unique name for your snapshot copy grant
- Specify the KMS key ID for which you’re creating the copy grant
- In the source AWS region:
- Enable copying of snapshots to the copy grant you just created
Redshift architecture: (image)
- Leader node:
- Managing communication with client programs and communication with Compute nodes
- Receives all the queries from client applications passes the queries
- Develops execution plans
- Coordinates the parallel execution of plans with Compute nodes
- Aggregates intermediate results from Compute nodes
- Return those results back to the client application
- Can contain 1 to 128 compute nodes depending on the node type
- Compute nodes:
- User data is stored on the Compute nodes
- Execute steps specified in the execution plans
- Send intermediate results to Leader node for aggregation before send back to client applications
- Has its own dedicated CPU, memory, attached disk storage which is determined by the node type I choose
- 2 types of Node types for Compute node
- Dense Storage node type => create very large date warehouse using HDD for a very low price point (image)
- Dense Compute node type => create very high performance data warehouses using fast CPUs large amounts of RAM and SSD (image)
- Every Compute node is divided into slices and a portion of the nodes memory and disk space is allocated to each slice
- Number of slices per node is determined by the node size of the cluster
-
Using Amazon Redshift with other services:
- S3
- Redshift leverages parallel processing to read and load data from multiple data files stored in S3 buckets
- Can also use parallel processing to export data from Redshift data warehouse to multiple data files on S3
- DynamoDB
- Can use the COPY command to load a Redshift table with data from a single DynamoDB table
- EMR/ EC2
- Can use the COPY command in Redshift to load data from one or more remote hosts, such as EMR clusters, EC2 instances, or other computers
- COPY connects to the remote hosts using SSH and executes commands on the remote hosts to generate data
- Redshift supports multiple simultaneous connections. COPY command reads and loads the output from multiple host sources in parallel
- Data Pipeline
- Can use Data Pipeline to automate data movement and transformation into and out of Redshift
- Can schedule and execute recurring jobs without having to write your own complex data transfer or transformation logic
- e.g. can set up a recurring job to automatically copy data from Amazon DynamoDB into Amazon Redshift
- Database Migration Service
- Can migrate data to Amazon Redshift using AWS Database Migration Service
- DMS can migrate your data to and from most widely used commercial and open-source databases such as Oracle, PostgreSQL, Microsoft SQL Server, Amazon Redshift, Aurora, DynamoDB, Amazon S3, MariaDB, and MySQL
Redshift Workload Management (WLM):
- Enables users to flexibly manage priorities within workloads so that short, fast-running queries won't get stuck in queues behind long-running queries
- Creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user-accessible queues
- Easiest way to modify the WLM configuration is by using the Redshift Management Console, Redshift CLI, or Redshift API
Automatic Workload Management:
- With automatic WLM, Redshift manages query concurrency and memory allocation
- Can create up to 8 queues
- Default for manual WLM is concurrency of 5 queries, and memory is divided equally between all 5. Automatic WLM determines the amount of resources that queries need, and adjusts the concurrency based on the workload
- When queries requiring large amounts of resources are in the system (e.g. hash joins between large tables), the concurrency is lower
- When lighter queries (e.g. inserts, deletes, scans, or simple aggregations) are submitted, concurrency is higher
- Can configure the following for each query queue:
- Priority
- Concurrency scaling mode
- When I turn on concurrency scaling, Redshift automatically adds additional cluster capacity to process an increase in read queries
- User groups
- Query groups
- Query monitoring rules
- Define metrics-based performance boundaries for WLM queues and specify what action to take when a query goes beyond those boundaries
- e.g. for a queue dedicated to short running queries, you might create a rule that cancels queries that run for more than 60 seconds
Manual Workload Management:
- With manual WLM, Redshift configures 1 queue with a concurrency level of 5, which enables up to 5 queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one
- Can define up to 8 queues
- Each queue can be configured with a maximum concurrency level of 50
- Each queue can have defined concurrency scaling mode, concurrency level, user groups, query groups, memory, timeout, query monitoring rules
- Can enable query queue hopping
- Timed out queries “hop” to next queue to try again
Working with Redshift concurrency scaling:
- Can support virtually unlimited concurrent users and concurrent queries, with consistently fast query performance
- When you turn on concurrency scaling, Redshift automatically adds additional cluster capacity to process an increase in read queries
- Users see the most current data, whether the queries run on the main cluster or a concurrency-scaling cluster
- I am charged for concurrency-scaling clusters only for the time they're actively running queries
- Can manage which queries are sent to the concurrency-scaling cluster by configuring WLM queues
Short Query Acceleration (SQA):
- Prioritize short running queries over longer running ones
- Short queries run in a dedicated space, won’t wait in queue behind long queries
- Can be used in place of WLM queues for short queries
- Works with:
- CREATE TABLE AS (CTAS)
- Read only queries (SELECT statements)
- Uses machine learning to predict a query’s execution time
- Can configure how many seconds is “short”
- SQA is an alternative to WLM to accelerate short queries
Resizing Redshift Clusters:
- Elastic resize:
- Quickly add or remove nodes of same type
- Can change node types, but not without dropping connections it creates a whole new cluster
- Cluster is down for a few minutes
- Tries to keep connections open across the downtime
- Limited to doubling or halving for some DC2 and RA3 node types
- Classic resize:
- Change node type and/or number of nodes
- Cluster is read only for hours to days
- Snapshot, restore, resize:
- Used to keep cluster available during a classic resize
- Copy cluster, resize new cluster
VACUUM command:
- Re-sorts rows and reclaims space in either a specified table or all tables in the current database
- 4 options
- VACUUM FULL
- VACUUM FULL is the default
- Sorts the specified table (or all tables in the current database) and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations
- VACUUM DELETE ONLY
- Reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations, and compacts the table to free up the consumed space
- Doesn't sort table data
- VACUUM SORT ONLY
- Sorts the specified table (or all tables in the current database) without reclaiming space freed by deleted rows
- VACUUM REINDEX
- Analyzes the distribution of the values in interleaved sort key columns, then performs a full VACUUM operation
Redshift anti-patterns:
- Small data sets
- Use RDS instead (Redshift is meant for massive data and highly scalable)
- OLTP
- Use RDS or DynamoDB instead
- Unstructured data
- BLOB data
- Store references to large binary files in S3, not the files themselves.
Redshift security concerns:
- Using a Hardware Security Module (HSM):
- Must use a client and server certificate to configure a trusted connection between Redshift and the HSM
- If migrating an unencrypted cluster to an HSM-encrypted cluster, you must create the new encrypted cluster and then move data to it
- Defining access privileges for user or group
- Use the GRANT or REVOKE commands in SQL
- e.g. grant select on table foo to bob;
Amazon Athena:
- Interactive query service for S3 (SQL)
- No need to load data, it stays in S3
- Presto under the hood
- Serverless
- Supports many data formats
- CSV (human readable)
- JSON (human readable)
- ORC (columnar, splittable)
- Parquet (columnar, splittable)
- Avro (splittable)
- Unstructured, semi-structured, or structured
- Use cases:
- Ad-hoc queries of web logs
- Querying staging data before loading to Redshift
- Analyze CloudTrail / CloudFront / VPC / ELB etc logs in S3
- Integration with Jupyter, Zeppelin, RStudio notebooks
- Integration with QuickSight
- Integration via ODBC / JDBC with other visualization tools
Athena + Glue Integration: (image)
- Glue crawler populating the Glue Data Catalogue for my S3 data that is looking at what is stored in S3 and tried to extract columns and table definitions out of it
- Can use Glue console to refine that definition as needed
- Once a Glue Data Catalogue publish for my S3 data, Athena will see it automatically and can build a table from it for querying
- This table also exposes to any applications that is compatible with Apache Hive Metastore, e.g. RDS, Redshift, Redshift Spectrum, EMR (Glue Data Catalog can be used as a Hive Metastore)
- Cross-region concerns
- Athena cannot query across regions on its own
- BUT a Glue Crawler can
- So, I can query S3 data across regions if I query a Glue Data Catalog in the same region as Athena
- And, the Glue Crawler that created the data catalog spanned multiple regions
Athena Workgroups:
- Can organize users / teams / apps / workloads into Workgroups
- Can control query access and track costs by Workgroup
- Integrates with IAM, CloudWatch, SNS
- Each workgroup can have its own:
- Query history
- Data limits (can limit how much data queries may scan by workgroup)
- IAM policies
- Encryption settings
Athena cost model:
- Pay as you go
- $5 per TB scanned
- Successful or cancelled queries count, failed queries do not
- No charge for DDL (CREATE/ALTER/DROP etc)
- Save LOTS of money by using columnar formats
- ORC, Parquet
- Save 30-90%, and get better performance
- Glue and S3 have their own charges
- Partitioning S3 data (by date/, hour etc.) can also help Athene to reduce costs. Queries that are restricted to that given partition will scan less data
Athena Security:
- Access control
- IAM, ACLs, S3 bucket policies
- AmazonAthenaFullAccess / AWSQuicksightAthenaAccess
- Encrypt results at rest in S3 staging directory
- Server side encryption with S3 managed key (SSE-S3)
- Server side encryption with KMS key (SSE-KMS)
- Client side encryption with KMS key (CSE-KMS)
- Cross account access in S3 bucket policy possible
- Transport Layer Security (TLS) encrypts in transit (between Athena and S3)
Athena anti-patterns:
- Highly formatted reports / visualization (that is what QuickSight is for)
- ETL (use Glue instead)
Athena Optimizing Performance:
- Use columnar data (ORC, Parquet)
- Small number of large files performs better than large number of small files
- Use partitions
- If adding partitions after the fact, use MSCK REPAIR TABLE command
- The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible partitions that were added to the file system after the table was created. MSCK REPAIR TABLE compares the partitions in the table metadata and the partitions in S3. If new partitions are present in the S3 location that you specified when you created the table, it adds those partitions to the metadata and to the Athena table.
Amazon Elasticsearch Service:
- Fully managed
- but not Serverless (still have to decide how many servers I want in my Elasticsearch cluster)
- Scale up or down without downtime
- But this is not automatic
- Pay for what you use
- Instance hours, storage, data transfer
- High level of network isolation (achieve with VPC)
- High level of data security
- data at rest and in transit encryption using KMS
- can manage authentication and access control using Cognito and IAM policies
- AWS integration
- S3 buckets (via Lambda to Kinesis)
- Kinesis Data Streams
- DynamoDB Streams
- CloudWatch / CloudTrail (operational monitoring)
- Zone awareness
- Allocate nodes in Elasticsearch service cluster across Multi-AZs in the same Region
- Increase HA but increase latency
- IoT (sending data into Elasticsearch from devices)
Introduction to Elasticsearch:
- The Elastic Stack (Elasticsearch, Kibana, Beats, and Logstash)
- A search engine (scalable version of Apache Lucene as distributed horizontally across nodes in a cluster)
- Can send data in the form of JSON documents to Elasticsearch using the API or ingestion tools such as Logstash and Amazon Kinesis Firehose. Elasticsearch automatically stores the original document and adds a searchable reference to the document in the cluster’s index
- Can then search and retrieve the document using the Elasticsearch API
- An analysis tool
- A visualization tool
- Can use Kibana, a visualization tool, with Elasticsearch to visualize your data and build interactive dashboards
- A data pipeline
- Use Beats/ LogStash to stream data into Elasticsearch cluster
- Beats send data from machines and systems to Logstash or Elasticsearch
- Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch
- Can use Kinesis too
- Horizontally scalable
- Use cases:
- Full text search (e.g. search on Wikipedia)
- Log analytics
- Application monitoring (visualize what is happening in real-time as data comes in from monitoring my servers)
- Security analytics
- Clickstream analytics
- Business analytics
- Operational intelligence
Elasticsearch Document, Index, Shard, Node:
- Elasticsearch document, type, indices concepts (image)
- An Index is split into Shards within a Node of Cluster (image)
- Primary Shards and Replicas for Redundancy (image)
- Shards (blue), Node (Rectangle)
Amazon Elasticsearch Service Launch Options:
- Dedicated master node(s) => choice of count and instance types
- Domains => collection of all the resources or configurations needed to run the Elasticsearch cluster
- Enable automatic snapshots to S3 for data backup purposes
- Zone awareness
Amazon Elasticsearch Service Performance:
- Memory pressure in the JVM can result if:
- I have unbalanced shard allocations across nodes
- I have too many shards in a cluster
- Fewer shards can yield better performance if JVMMemoryPressure errors are encountered
- By deleting old or unused indices
Amazon Elasticsearch Service Security:
- Resource-based policies (can attach to the service domain that determines what actions a principal (user, account or role that can be granted access) can take on Elasticsearch API
- Identity based policies
- attached to an IAM user, group, or role. These policies let me specify what that identity can do (its permissions)
- IP based policies
- restrict access to a domain to one or more IP addresses or CIDR blocks
- Request signing (all requests must be signed. Send in request from AWS SDKs to Elasticsearch give me all the means to digitally sign those requests
- VPC (host Elasticsearch cluster within a VPC)
- Cognito (for Kibana)
Securing Kibana:
- Cognito
- Log into Kibana through an enterprise identity provider (e.g. Microsoft Active Directory) using saml 2.0 and also through social identity providers (e.g. Google, Facebook, Amazon).
- Use Facebook account and that will grant them access to Kibana in VPC
- Or using Cognito user pools (a user directory in Cognito. With a user pool, users can sign in to web or mobile app through Cognito. Users can also sign in through social identity providers like Google, Facebook, Amazon, or Apple, and through SAML identity providers)
- Alternative to get inside VPC
- Nginx reverse proxy on EC2 forwarding to Elasticsearch domain (image)
- SSH tunnel for port 5601
- VPC Direct Connect
- VPN
Amazon Elasticsearch Service Anti-patterns:
- OLTP
- No transactions
- RDS or DynamoDB is better
- Ad-hoc data querying
- Remember Amazon Elasticsearch Service is primarily for search & analytics
Kinesis Data Analytics:
- Receive data from either Kinesis Data Stream or Kinesis Data Firehose, Reference Data (JSON, CSV) in S3
- Can set up windows of time that I can look back on and aggregate and analyze data across
- Can write SQL to analyze data output the results to some other stream which ultimately might go to some analytics tools
- Use cases:
- Streaming ETL (e.g. application reads IoT sensor data stored in Kinesis Data Stream, organize data by sensor type, remove duplicate data, normalize data, deliver processed data to S3)
- Continuous metric generation
- e.g. Live leader board for mobile game by computing the top players every minute and send that data to DynamoDB
- e.g. track traffic to a website by calculating the number of unique visitors every 5 minutes and send processed data to Redshift for analysis
- Responsive analytics
- e.g. application computing the availability or success rate of customer facing API over time and send that data to CloudWatch. Have another application look for events that meet certain criteria and automictically notify the right customers using Kinesis Data Streams and SNS
- Pay only for resources consumed (but it’s not cheap)
- Serverless ; scales automatically
- Use IAM permissions to access streaming source and destination(s)
- Schema discovery feature (can generate a schema from either the data in a stream or data in a static file that is stored in S3 bucket)
- RANDOM_CUT_FOREST function:
- Detects anomalies in my data stream. A record is an anomaly if it is distant from other records
- SQL function used for anomaly, detection on numeric columns in a stream
- Identify outliers in a data set
3 main parts: (image)
- Input and source data:
- Can either be Kinesis Data Stream or Kinesis Data Firehose streams that goes into Input Streams
- Can also optionally configure a reference data source to enrich my input data stream within the application (result in Reference table that I refer/ lookup to within my SQL of my analytics job)
- Store my reference data as an object in S3 bucket. Kinesis Data Analytics will read the S3 object and create a Reference table for me to refer to
- Inexpensive way to “join” data for quick lookups (e.g. look up the city associated with a zip code)
- Real-time analytics using SQL
- Processed data going out to the Output Streams
- Either be Kinesis Data Stream or Kinesis Data Firehose
- From Kinesis Data Firehose streams, it can go off to S3 or Redshift
- AWS Lambda can be a destination as well of the Output Streams
- Allows lots of flexibility for post processing (Aggregating rows, Translating to different formats, Transforming and enriching data, Encryption)
- Opens up access to other services & destinations (S3, DynamoDB, Aurora, Redshift, SNS, SQS, CloudWatch)
- If any errors are encountered, those are sent out to the Error Stream
- Kinesis Data Analytics will automictically provide an in-application error stream for every application
What Is Amazon Kinesis Data Analytics for Apache Flink? (image)
- Enables me to author and run code against streaming sources to perform time-series analytics, feed real-time dashboards, and create real-time metrics
- Can use Java, Scala, or SQL to process and analyze streaming data
- Flink is a framework for processing data streams
- Kinesis Data Analytics integrates Flink with AWS
- Flink applications primarily use either the DataStream API or the Table API
- Instead of using SQL, I can develop my own Flink application from scratch and load it into Kinesis Data Analytics via S3
- Serverless
What is RDS?
- Hosted relational database
- Aurora, MySQL, PostgreSQL, MariaDB, Oracle, SQL Server
- Not for “big data”
- RDS databases offer full ACID compliance
- Atomicity (ensure that either the transaction as whole is successfully executed or if part of a transaction fails then the entire transaction is invalidated
- Consistency (ensures that data written into the database as part of the transaction must adhere to all defined rules and restrictions including constraints cascades and triggers)
- Isolation (ensures that each transaction is independent in itself)
- Durability (ensures that all of the changes made to the database be permanent)
Amazon Aurora:
- MySQL and PostgreSQL compatible
- Up to 5X faster than MySQL, 3X faster than PostgreSQL
- 1/10 the cost of commercial databases
- Up to 64TB per database instance
- Up to 15 read replicas
- Continuous backup to S3
- Replication across availability zones
- Automatic scaling with Aurora Serverless
- Aurora Security:
- VPC network isolation
- At rest with KMS
- Data, backup, snapshots, and replicas can be encrypted
- In-transit with SSL