Coggle requires JavaScript to display documents.
Leader node - query planning and results aggregation Compute node - perform queries and send results to leader
Automated: takes incremental snapshots that track changes to the cluster since the previous automated snapshot by default a snapshot every 8 hours or every 5GB data change per node, but you can modify it retention period is 1 day by default, but this can be raised up to 35 days at most if you want to keep an automated snapshot for a longer period, you can create a copy of it as a manual snapshot Manual: is retained until you manually delete it or until the end of the retention period
takes incremental snapshots that track changes to the cluster since the previous automated snapshot by default a snapshot every 8 hours or every 5GB data change per node, but you can modify it retention period is 1 day by default, but this can be raised up to 35 days at most if you want to keep an automated snapshot for a longer period, you can create a copy of it as a manual snapshot
is retained until you manually delete it or until the end of the retention period
Source Region: create encrypted snapshot with KMS Key A Source Region: snapshot copy grant Enables Redshift to to encrypt copied snapshots in the Destination Region Source Region: copy to Destination Region
Enables Redshift to to encrypt copied snapshots in the Destination Region
Columnar storage of data (instead of row based) Massively Parallel Query Execution (MPP) Distributes queries across many nodes of the cluster Defines multiple queues with different priorities and route queries to the appropriate query Automatic WLM - queues and resources managed by Redshift Manual WLM - queues and resources managed by user Load data from S3, Kinesis Firehose, DynamoDB, DMS ... Enhanced VPC Routing: COPY / UNLOAD goes through VPC Spectrum query data in S3 without loading in the cluster (query is submitted to thousands of Redshift Spectrum nodes) Concurrency Scaling Automatically adds additional cluster capacity Supports both Read and Write SQL statements Ability to decide which queries sent to the Concurrency Scaling Cluster using WLM
Distributes queries across many nodes of the cluster Defines multiple queues with different priorities and route queries to the appropriate query Automatic WLM - queues and resources managed by Redshift Manual WLM - queues and resources managed by user
Automatically adds additional cluster capacity Supports both Read and Write SQL statements Ability to decide which queries sent to the Concurrency Scaling Cluster using WLM
Loading DynamoDB data into the Hadoop Distributed File System (HDFS) and using it as input into an Amazon EMR cluster Querying live DynamoDB data using SQL-like statements (HiveQL) Joining data stored in DynamoDB and exporting it or querying against the joined data Exporting data stored in DynamoDB to Amazon S3 | DynamoDB -> S3 Importing data stored in Amazon S3 to DynamoDB | S3 -> DynamoDB
Add steps up to 256 pending/running steps Interactively submit Hadoop jobs in the primary node Even if you have 256 running steps you can interactively submit job to the primary node
up to 256 pending/running steps
Even if you have 256 running steps you can interactively submit job to the primary node
HDFS - Hadoop Filesystem, used for temporary storage while processing. Provides high performances EBS Volume (local filesystem) Instance Store EMRFS - S3 native integration S3 backed for permanent storage with server-side encryption DynamoDB is integrated with Apache Hive
EBS Volume (local filesystem) Instance Store
S3 backed for permanent storage with server-side encryption
Primary Node: Manage the cluster, coordinate, manage health – long running Core Node: Run tasks and store data – long running Task Node (optional): Just to run tasks – usually Spot Purchasing options On-demand: reliable, predictable, won’t be terminated Reserved (min 1 year): cost savings (EMR will automatically use if available) Spot Instances: cheaper, can be terminated, less reliable
On-demand: reliable, predictable, won’t be terminated Reserved (min 1 year): cost savings (EMR will automatically use if available) Spot Instances: cheaper, can be terminated, less reliable
EMR cluster is launched in VPC and in a single AZ Long-running vs transient (temporary) cluster
When you choose an instance type using the AWS Management Console, the number of vCPU shown for each Instance type is the number of YARN vCores for that instance type, not the number of EC2 vCPUs for that instance type Instance fleets and instance groups cannot coexist in a cluster
Single instance type and purchasing option (on-demand vs spot) for each node Has auto scaling Up to 48 task instance groups Configuration option you choose applies to all nodes, it applies for the lifetime of the cluster
Set target capacity, mix instance types and purchasing options (# on-demand units vs # spot units) No Custom Auto Scaling AWS Console: up to 5 instance types per node type CLI / API: up to 30 instance types per node type
RANDOM_CUT_FOREST SQL function used for anomaly detection on numeric columns in a stream adapts over time, it only uses recent history to compute the model
SQL function used for anomaly detection on numeric columns in a stream adapts over time, it only uses recent history to compute the model
HOTSPOTS locate and return information about relatively dense regions in your data
locate and return information about relatively dense regions in your data
default 24 hours extended up to 7 days long-term up to 365 days
In exam scenario referring to serverless SQL solution, go for Athena You can use Athena to query logs stored in S3 buckets Commonly used with QuickSight
Use Apache Parquet or ORC Columnar data for cost-saving (less scan) Use Glue to convert your data to Parquet or ORC Compress data for smaller retrievals Partition datasets in S3 for easy querying on virtual columns Example s3://athena-examples/flight/parquet/year=1991/month=1/day=1/ Use larger files (> 128 MB) Retrieve fewer larger files is more efficient than retrieving may small files
Columnar data for cost-saving (less scan) Use Glue to convert your data to Parquet or ORC
Example s3://athena-examples/flight/parquet/year=1991/month=1/day=1/
Retrieve fewer larger files is more efficient than retrieving may small files
Allows you to run SQL queries across data stored in in relational, non-relational, object storage, and custom data sources (in AWS and on-premise) Uses Data Source Connectors that run on AWS Lambda to run Federated Queries (e.g., CloudWatch Logs, DynamoDB, RDS, Aurora, ElastiCache, DocumentDB, HBase EMR, ..) Store results in S3
Data cleaning steps in DataBrew are stored as a recipe A recipe is connected to a project by default An existing recipe with no associated project could also be applied to a project and Datasets
A visual data preparation tool for cleaning and normalizing data to prepare it for analytics and ML Explore data in columns with 40+ quality statistics - find anomalies/patterns +250 ready-made transformations (e.g. filtering anomalies, data conversion, correct invalid values, ...) Data sources include S3, Redshift, Aurora, Glue Data Catalog, ...
Visual ETL generate ETL code in Python or Scala Notebook Script editor
Is a fully managed Business Intelligence (BI) data visualization service It allows you to create dashboards and share them within your company
The first step of building a data lake is ingesting from a variety of sources Catalog data The data is then enriched, combined, and cleaned before analysis This makes it easy to discover and analyse the data with direct queries, visualisation, and machine learning (ML)
AWS Glue, Amazon Athena, Amazon Redshift, Amazon QuickSight Amazon EMR using Zeppelin notebooks with Apache Spark
Enables users to search, subscribe to, and use third-party data Provides a central catalog where data: Providers publish their data products Subscribers can search and subscribe to data products Open Data on AWS program provides publicly available datasets accces with or without an AWS account
Providers publish their data products Subscribers can search and subscribe to data products
Data grant - unit of exchange in AWS Data Exchange created by a data sender (provider) to grant a data receiver (subscriber) access to a data set Data sender creates a data grant and a grant request is sent to data receiver that can then accept it to gain access Made up : data set, data grant details and recipeint access details Data set: Files API Redshift S3 AWS Lake Formation Product - unit of exchange in AWS Marketplace that is published by a provider and made available for use to subscribers AWS Data Exchange API – Use the API operations to create, view, update, and delete data sets and revisions AWS Marketplace Catalog API – Use the API operations to view and update data products published to AWS Marketplace Product has details, offers and data sets Receiver (Subscriber) All data products on AWS Data Exchange are subscription-based If a data provider decides to unpublish a data product, you will still have access to the data sets as long as your subscription to that product is active Data providers may require you to verify your subscription and provide additional information before you can access their products If your AWS account is part of an organization, you can share your AWS Data Exchange product licenses with the other accounts in that organization Sender (Provider) Can give access to products that are not publicly released Offer is created to make a data product available
Data sender creates a data grant and a grant request is sent to data receiver that can then accept it to gain access Made up : data set, data grant details and recipeint access details
Files API Redshift S3 AWS Lake Formation
AWS Data Exchange API – Use the API operations to create, view, update, and delete data sets and revisions AWS Marketplace Catalog API – Use the API operations to view and update data products published to AWS Marketplace Product has details, offers and data sets
All data products on AWS Data Exchange are subscription-based If a data provider decides to unpublish a data product, you will still have access to the data sets as long as your subscription to that product is active Data providers may require you to verify your subscription and provide additional information before you can access their products If your AWS account is part of an organization, you can share your AWS Data Exchange product licenses with the other accounts in that organization
Can give access to products that are not publicly released Offer is created to make a data product available
S3 – AWS Data Exchange allows providers to import and store data files from their S3 buckets or directly provide access to and use S3 buckets API Gateway – Data recipients can call the API programmatically, call the API from the AWS Data Exchange console, or download the OpenAPI specification Redshift – Data recipients can get read-only access to query the data in Amazon Redshift without extracting, transforming, and loading data AWS Marketplace – AWS Data Exchange allows data sets to be published as products in AWS Marketplace AWS Lake Formation – Data recipients get access to data stored in AWS Lake Formation data lake and can query, transform, and share access to this data from their own AWS Lake Formation
Open source: Flink, Beam and Zeppelin notebooks Real-time query and analyze streaming data Java, Scala, Python, and SQL Sources: Kinesis Data Streams Amazon MSK Sinks: Kinesis Data Streams Kinesis Data Firehose S3 DynamoDB OpenSearch CloudWatch Glue Schema Registry Custom connectors Provisions and configures your Flink clusters Orchestrates Flink job management Apache Flink APIs and Apache Flink Studio notebook Serverless
Kinesis Data Streams Amazon MSK
Kinesis Data Streams Kinesis Data Firehose S3 DynamoDB OpenSearch CloudWatch Glue Schema Registry Custom connectors
Low-latency High-throughput Exactly-once processing Stateful processing - stores state (previous and in-progress computations) in running app storage Durable application backup via checkpoints and snapshots
Streaming ETL Continuous metric generation Responsive real-time analytics - real-time alarms when metrics reach thresholds or when your application detects anomalies Interactive analysis of data streams - stream data exploration in real time