Please enable JavaScript.
Coggle requires JavaScript to display documents.
AWS Certified Data Analytics - Specialty (DAS-C01), Kinesis Enhanced Fan…
AWS Certified Data Analytics - Specialty (DAS-C01)
Collection
Real Time
- Immediate actions
kinesis Data Streams (KDS)
Stream Big
data in your system
Arhictecture notes
Capacity Mode notes
Provision
On - demand
Security notes
Kinesis Producers
Kinesis SDK
Put - records notes
Kinesis Producer Library (KPL)
Synchronous or Asynchronous
No compresion by default, end user needs to develop the feature
Kinesis Agent
Kinesis Consumers
SDK -getRecords
each shartd has 2 mb total agregate throughput
returns up to 10 mb of data or up to 10,000 records
if 5 consumers application sconsume from the same shard, means every consumer can poll once a second and receive less than 400 KB/s
Kinesis client libray KCL
JAVA but exists other languages too
Leverages DynamoDB coordinations and checkpoint
if you have expiredIteratorException => increase WCU
Kinesis Connectgor Library
connect to
s3
dinamodb
redshift
Opensearch
lambda
puedes mandarlo a s3, dynamo, redshift, etc.
can use to trigger notifications or send email
Shard
1 MB to producers
2 MB for consumer
Operations
adding shards
the old shard is closed and will be deleted once the data is expired
Merging shards - or decrease shard
decrease the stream capacity
old shards are closed and deleted based on data expiration
Auto scaling
it takes time. it has limitations
the order for the exam :check:
after create a new shard there is a parent shard that has maybe pakager of information, maybe, the consumer recibe in desorder shard, the consumer has a logic in order no recibe more shards if parent has shards withoud send, whe need to conserve the shard order
Handinling dplicates for producers
Due to
netwok timeouts
el producer no se da cuenta que no llego el paquete porque no se retorna ningun error entonces vuelve y lo envia al no tener respuesta y ek kinesis queda repetido
for
consumer
fixes
make your consumer app idempotent
if the final destination can handle duplicates, its recommended to do it there
Simple Queue Service (SQS)
IoT
Kinesis Data Firehose
near real time (60 seconds latency)
load into
redshift/s3/opensearch/splunk
data transformations througt lambda
pay for the amount of data going through firehose
Firehose buffer sizing
TIme: every 2 minutos its flushed
size: 32 mb, if that buffer size is reached, it's flushed
Kinesis Data Streams vs Firehose
Streams
Going to write custome code (producer/consumer)
real time (entre 200 ms y 70 ms latency)
Must manage scaling (shard spitting /merging)
Data storage for 1 to 365 days, replay capability, multi consumers
Use with Lambda to insert data in real-time to OpenSearch (for example)
Firehose
Fully managed, send to S3, Splunk, Redshift, OpenSearch
Serverless data transformations with Lambda
Near real time (lowest buffer tiime is 1 minute)
Automated Scaling
No data storage
Near Real Time
- Reactive Actions
Kinesis Data Firehose (KDF)
Database Migration Service (DMS)
Batch
Historical
Snowball
Data Pipeline
DMS - Database Migration Service
sources
EC2 on-premises, oracle, postgres etc
azure
rds aurora
s3
document db
targets
on premise
opensearch
redshiift
any data base on aws
AWS Schema conversion tool (SCT)
SCT es para esquemas diferentes, osea si es postrest to posgres esto no es necesario
Continueous Replication
Direct Connect (DX)
USE CASES
increase bandwithd
more consistent ntework experience
hybrid enviroments (onprem + cloud)
Time: si preguntan que es mas rapido para empezar y el direct connect no esta configurado no puede ser la respuesta pq esto se demora alrededor de un mes en estar OK
Resiliency
se puede tener high resilenciy, good for critical workloads
AWS Snow Family
Highly-secure, portable devices to collect and process data at
the edge, and migrate data into and out of AWS
Snowcone
Up to 24 TB, online and
offline
Snowball Edge
Up to petabytes,
offline
Snowmobile
Up to exabytes, offline
Amazon MSK - KAFA
La principal diferencia con kinesis stream es que se puede configurar para transmitir mensajes de mas de 1mb, STREAMS solo permite mensaje de 1 mb
Can create custom configurations for your clusters
Default message size of 1MB
Possibilities of sending large messages (ex: 10MB) into Kafka after custom configuration
Security
TLS betewwn brokers, between clientes and brokers and you can disable this setting if you want or for performance propouses
Tiene mas de un metodo para autenticar aparte de IAM
Tiene version Serverless
Storage
S3
Buckets
Objects have a key
S3 looks like a global service but buckets are created in a region
Buckets are defined at the region level
Buckets must have a globally unique name (across all regions all accounts)
Amazon S3 allows people to store objects (files) in “buckets” (directories)
Security
Json policies
Replication
CRR - Cross region replication
SRR - Same Region Replication
Important
After active replication only new objects will be replicated. For older objects use Batch Replication
Storage Clases
Standard
General Propose
used for frequentrly access data
big data analytics
mobile and gaming apps
content distribution
Infrecuent Access IA
disaster recovery
backups
One Zone infrecuent access
storing secondary backup copies of on-premise data or data you can create
Glacier
Instant Retrieval
milisenconds retrieval
Flexible Retrival
Expedited - 1 a 5 mintos
Standard - 3 a 5 horas
Bulk 5 a 12 horas
Deep Archive
Standard - 12 hr
bulk 48 hr
Intelligent Tiering
Moves automatically based on usage
Durability and Availability
Lifecicle configurations
Multipart upload
Transfer acceleration
S3 select
retrieve data using SQL
Object Encryption
server-side encryption SSE
key is handled, managed and owned by AWS
Server Side encryption
KMS
handled and managed by AWS KMS
Server side encryption
SSE-C
Server side encryption
keys are gully managed by
de customer outside of AWS
HTTPS must be used
Client side encryption
use client libraries such as Amazon s3 Client-Side Encryption Library
Encryption in transit
SSL/TLS
HTTPS is recommended
Access Points
Create folders like "finance" or "sales"
simple bucket policy could be used
DinamoDb
Fully managed, highly available with replication across multiple AZs
NoSQL database - not a relational database
Scales to massive workloads, distributed database
Millions of requests per seconds, trillions of row, 100s of TB of storage
Fast and consistent in performance (low latency on retrieval)
Integrated with IAM for security, authorization and administration
Enables event driven programming with DynamoDB Streams
Low cost and auto-scaling capabilities
Standard & Infrequent Access (IA) Table Class
Basics
DynamoDB is made of Tables
Each table has a Primary Key (must be decided at creation time)
Each table can have an infinite number of items (= rows)
Each item has attributes (can be added over time – can be null)
Maximum size of an item is 400KB
Data types supported are:
Scalar Types – String, Number, Binary, Boolean, Null
Document Types – List, Map
Set Types – String Set, Number Set, Binary Set
When use it?
you have some data that needs to be
very hot
, that needs to be ingested at scale within a database,
Common use cases
Mobile apps
Gaming
Digital ad serving
Live voting
Audience interaction for live events
Sensor networks
Log ingestion
Access control for web-based content
Metadata storage for Amazon S3 objects
E-commerce shopping carts
Anti Pattern
Prewritten application tied to a traditional relational database: use RDS instead
Joins or complex transactions
Binary Large Object (BLOB) data: store data in S3 & metadata in DynamoDB
Large data with low I/O rate: use S3 instead
DynamoDB –Read/Write Capacity Modes
Provisioned Mode (default)
You specify the number of reads/writes per second
You need to plan capacity beforehand
Pay for provisioned
read & write capacity units
Read Capacity Units (RCU)
Examples view ppt
Strongly Consistent Read vs. Eventually Consistent Read
Eventually Consistent Read (default)
If we read just after a write, it’s possible we’ll get some stale data because of replication
Strongly Consistent Read
If we read just after a write, we will get the correct data
Set “ConsistentRead” parameter to True in API calls (GetItem, BatchGetItem, Query, Scan)
Consumes twice the RCU
Formula :
•OneRead Capacity Unit (RCU)represents one Strongly Consistent Read per second, or two Eventually Consistent Reads per second, for an item up to 4 KB in size
•If the items are larger than 4 KB, more RCUs are consumed
Si es strongly se deja igual
SI es eventually se divide entre 2
Los KB deben de ser multiplos de 4 sino da menos se sube al mas cercano, se redondea.
Formula:
ItemsPerSecond * ItemSizeInKB. Remember round kb to upside
Write Capacity Units (WCU)
•OneWrite Capacity Unit (WCU)represents one write per second for an item up to 1 KB in size
•If the items are larger than 1 KB, more WCUs are consumed
Examples
View ppt
Formula: ItemsPerSecond * ItemSizeInKB. Remember round kb to upside
On-Demand Mode
Read/writes automatically scale up/down with your workloads
No capacity planning needed
Pay for what you use, more expensive ($$$)
“ProvisionedThroughputExceededException”
Reasons
• Hot Keys – one partition key is being read too many times (e.g.,
popular item)
• Hot Partitions
• Very large items, remember RCU and WCU depends on size of items
Solutions
Exponential backoff when exception is encountered (already in SDK)
• Distribute partition keys as much as possible
• If RCU issue, we can use DynamoDB Accelerator (DAX)
Index
Global
SI falla el query falla la tabla.
Este lo puedes hacer despues de creada la tabla
puedes crear un nuevo quiery con los atributos que necesites
Local
se crea inmediatamente se crea la tabla y no se puede modificar
es como agregar un atributo para busqueda solamente
PartiQL
Allows you to select, insert, update, and
delete data in DynamoDB using SQL
DAX
Catch the most popular items or querys in dynamoDB
Monta una cache con clusters
Diferencia con
Amazon ElastiCache
es que dax es para queries y objects. Pero si luego de eso quieres hacer calculos de computo, hacer sort y demas lo puedes combinar con Elasticache en este caso es mejor opcion porque si la consulta o el scan es frecuente hay mucho gasta de computo. Si es Agregacon utilizat Elasticache
DynamoDB Streams
• react to changes in real-time (welcome email to users)
• Analytics
• Insert into derivative tables
• Insert into OpenSearch Service
• Implement cross-region replication
Using sharsd como kinesis data stream
no hay que hacer provision de shards
TTL time to live
Automatically delete items after an expiry
timestamp
AWS ElastiCache
Es como un RDS pero cache, redis tambien es cache. Memcached tambien por cache
good performance y para casos de uso de cache
Amazon Neptune
Graph databases, es para uso de redes sociales
Timestream
• Fully managed, fast, scalable, serverless
time series
database
Processing
Lambda
Use cases
Real-time file processing • Real-time stream processing • ETL • Cron replacement • Process AWS events
Anti - Patterns
Long
-running applications
• Dynamic websites
Stateful applications
Is stateless
AWS Glue
Serverless discovery and definition
of table definitions and schema
• S3 “data lakes” • RDS • Redshift • DynamoDB • Most other SQL databases
Custom ETL jobs
Trigger-driven, on a schedule, or on
demand
• Fully managed
Glue Crawler
scan data in S3 and create schema
Glue and S3 Partitions
• Glue crawler will extract partitions based on how your S3 data is organized
Glue + Hive
Hive lets you run SQL-like queriesfrom EMR
The Glue Data Catalog can serve asa Hive “metastore”
You can also import a Hivemetastore into Glue
ETL
Transforma data, clean data, Enrich Data
Automatic code generation
• Scala or Python
• Encryption
• Server-side (at rest)
• SSL (in transit)
• Can be event-driven
• Can provision additional “DPU’s” (data processing units) to increase performance of underlying Spark jobs
• Enabling job metrics can help you understand the maximum capacity inDPU’s you need
• Errors reported to CloudWatch
• Could tie into SNS for notification
Glue scheduler
Glue trigger events
Deal with Ambiguities: ReolveChoice
make_cols
creates a new column for arch type: por ejemplo si el campo se llama igual y tiene dos tipos diferentes en cada fuente
cast
forzar a un tipo especifico
make_ struct
crea una estructura con los dos tipos de datos algo asi: "myList": [ { "price": 100.00 }, { "price": "$100.00" } ]
project
se puede proyectar todo a string entonces forza a que todo sea string
Modifying the data catalog
puedes correr el script otra vez y agregar nuevas particiones, updating table schema, creatng new tables
Restrictions: solo para S3,json, csv, avro, parquet
Running glue jobs
job bookmarks : para correrlo solo para new rows y no repetir
TIme - bases schedules
CloudWatch Events
Cost Model
Billed by the second for crawlerand ETL jobs
First million objects stored andaccesses are free for the GlueData Catalog
Development endpoints fordeveloping ETL code charged by the minute
para hacer transformaciones visuales
esta glue studio que es normal y otro que se llama databrew que es mas simple
AWS Lake Formation
• “Makes it easy to set up a secure data lake
in days”
• Loading data & monitoring data flows • Setting up partitions • Encryption & managing keys • Defining transformation jobs & monitoring
them
• Access control • Auditing • Built on top of Glue
Cómo funciona
Toma datos de S3, RDS, NoSql etc, se ejecuta el blue pint de Lakeformation. Eso hace transforamcones y de todo. Al fnal envia esa info a athena, redshift o EMR
Governed Tables
support ACID transaccions across multiple tables
Permissions: puedes usar permisos granulares en las tablas para dar accesos y persmisos como insertar, borrar, etc
EMR : Elastic MapReduce
DIferencias entre Haddoop y Spark
:
Cluster
Master note: manage the cluster
Core node: hosts HDFS data and runs tanks
Task node: Run tasks, does not host data, no risk of data loss when removing
good use for spot instances
Ways of deployment
Transient clusters
terminate once all steps are complete
• Loading data, processing, storing – then shut down
• Saves money
Long-Running Clusters
must be manually terminated
• Basically a data warehouse with periodic processing on large datasets
• Can spin up task nodes using Spot instances for temporary capacity
• Can use reserved instances on long-running clusters to save $
• Termination protection on by default, auto-termination off
Storage
HDFS
This is ephemeral, data is lost when cluster is termanted!
EMRFS: Access S3 as if it were HDFS
Allows persistent storage after cluster termination
Local FIle system
Temporary data (buffers, caches, etc)
EBS for HDFS
Deleted when cluster is terminated
Hadoop
MapReduce
Framework for distributed data processing
Maps data to key/value pairs
Reduces intermediate results to final output
Largely supplanted by Spark these days
Se usa para trabajos de big data, usa tareas en paralelo, distribuidas, para filtrar y hacer operaciones con grandes volumenes de información
HDFS
Hadoop Distributed File System
Distributes data blocks across cluster in a
redundant manner
Ephemeral in EMR; data lost on termination
EMR Serveless App Lifecycle
Manually, create app, start app, stop app y delete app. Si se olvida hacer esto igual sigue corriendo y cobra plata
EMR on EKS
Fully Managed
No provision clusters
Apache Pig
On top on MapReduce, para escribir codigo a mano en operaciones que tardan largo tiempo
Hbase
Non-relational db
In-memory
hive integration
Presto
Conectar con muchos data sources y database, hacer quies interatiivos at petabyte scale
query toll
Apache Zeppelin es como iPython notebooks
Puedes ejecutr SQL directamente contra SparkSQL y visualizar en charts y graphs
es como un notebook para bigdata
EMR Notebook
Nubtooks que se guardan en S3 es similar a zeppelin cpero conmas integracion
Mote tools
Hue
Hadoop user exprience, es un front end grafico
Splunk
es una tool operacional para ver EMR and S3 data using your EMR Hadoop Cluster
Flume
Another way to stream data in your cluster
MXnet
Like Tensofrflow, a library for building and accelerating nural networks
INcluded in EMR
S3DistCP
Tool for copying large amount of data from S3 into HDFS y desde HDFS a S3
Mas tools
• Ganglia (monitoring)
• Mahout (machine learning)
• Accumulo (another NoSQL database)
• Sqoop (relational database connector)
• HCatalog (table and storage management for Hive metastore)
• Kinesis Connector (directly access Kinesis streams in your scripts)
• Tachyon (accelerator for Spark)
• Derby (open-source relational DB in Java)
• Ranger (data security manager for Hadoop)
Apache Spark
Distributed processing framework for big data
• In-memory caching, optimized query execution
• Supports Java, Scala, Python, and R
• Supports code reuse across
• Batch processing
• Interactive Queries: Spark SQL
• Real-time Analytics
• Machine Learning : MLLib
• Graph Processing
• Spark Streaming : Integrated with Kinesis, Kafka, on EMR
• Spark is NOT meant for OLTP: No es por transacciones es para procesar volumen de datos en seguntos o minutos
How Spark Works
• Spark apps are run as independent processes on a cluster
• The SparkContext (driver program) coordinates them
• SparkContext works through a Cluster Manager
• Executors run computations and store data
• SparkContext sends application code and tasks to executors
Components
Spark Core
Memory management, fault recovery, scheduling, distribute & monitor jobs, interact with storage Scala, Python, Java, R
Spartk Streaming
Real-time streaming analytics Structured streaming Twitter, Kafka, Flume, HDFS, ZeroMQ
Spark SQL
Up to 100x faster than MapReduce JDBC, ODBC, JSON, HDFS, ORC, Parquet, HiveQL
MLLib
Classification, regression, clustering, collaborative filtering, pattern mining Read from HDFS, HBase
GraphX
Graph Processing ETL, analysis, iterative graph computation No longer widely used
Integrations
Kinesis
Dynamo
Athena
Apache Hive
Uses familiar AWL syntax (HiveQL)
Puedes escribir SQL on top de map reduce
Metastore is stored in Mysql on the master node by default
LOs mteastore se pueden guardar en lo de aws, athena, glue, rds, etc
Data Pipeline
• Destinations include S3, RDS,
DynamoDB, Redshift and EMR
• Manages task dependencies • Retries and notifies on failures • Cross-region pipelines • Precondition checks • Data sources may be on-premises • Highly available
Activities
EMR : como iniciar el cluster pera realizar operacionesi
Hive queries
Copy data between data soruces
SQL
Scripts
Step Functions
Use to desing workflows
Easy visualizations
Max execution time of a State Machine is 1 year
Analysis
Analysis
Amazon Kinesis Data Analytics for SQL
Applications
APlicar SQL a los datos, tambien le puede poner lambda y mandarlo a otras fuentes resultado
Managed service for Apache Flink: es serverless
Flink is a framework for processing data streams
Amazon Kinesis Data Analytics
common use-cases
Streaming ETL
Continuous metric generation
Responsive Analytics
RANDOM_CUT_FOREST
• SQL function used for anomaly detection on numeric columns in a stream
• They’re especially proud of this because they published a paper on it
• It’s a novel way to identify outliers in a data set so you can handle them however you need to
• Example: detect anomalous subway ridership during the NYC marathon
Amazon Opensearch Service
(formerly Elasticsearch) - Petabyte-scale analysis and reporting -
What Is Opensearch?
• A fork of Elasticsearch and Kibana
• A
search engine
• An analysis tool • A visualization tool (Dashboards = Kibana)
• A data pipeline • Kinesis replaces Beats & LogStash
• Horizontally scalable
OpenSearch - use cases
• Full-text search • Log analytics • Application monitoring • Security analytics • Clickstream analytics
Concepts
Documents tambien pueden ser json
types
por ejemplo documento, un log entry, un articulo, etc
Indices
an index is split into shards
documents are hashed to a particular shard
each shard may be on a different node in a cluster
every shard is a self-contained Lucene index of its own
Features
• Fully-managed (but not serverless) - There is a separate serverless option now
• Scale up or down without downtime - But this isn’t automatic
• Pay for what you use - Instance-hours, storage, data transfer
• Network isolation
• AWS integration - S3 buckets (via Lambda to Kinesis) - Kinesis Data Streams - DynamoDB Streams - CloudWatch / CloudTrail - Zone awareness
Anti-pattners
• OLTP • No transactions • RDS or DynamoDB is better : No para transacciones
• Ad-hoc data querying • Athena is better : mejor usar athena para queries
Remember Opensearch is primarily for search & analytics
Storate
Hot
Es el standard
UltraWarm
• UltraWarm (warm) storage uses S3 + caching
Best for indices with few writes (like log data / immutable data)
• Slower performance but much lower cost
• Must have a dedicated master node
Cold storage
• Also uses S3
• Even cheaper
• For “periodic research or forensic analysis on older data”
• Must have dedicated master and have UltraWarm enabled too.
• Not compatible with T2 or T3 instance types on data nodes
• If using fine-grained access control, must map users to cold_manager role in OpenSearch Dashboards
Supported ways to import data into ES domain
Kinesis, DynamoDB, Logstash / Beats, and Elasticsearch's native API's offer means to import data into Amazon ES.
Amazon Athena
What is Athena?
• Interactive query service for S3 (SQL)
Serverless!
• Supports many data formats • CSV, TSV (human readable) • JSON (human readable) • ORC (columnar, splittable) • Parquet (columnar, splittable) • Avro (splittable) • Snappy, Zlib, LZO, Gzip compression
Unstructured, semi-structured, or structured
Athena + Glue
Se integra con el catalogo, schema etc
Athena Workgroups
organize users, teams, apps, workloads
Integrates with IAM, CloudWatch, SNS
Each grup can have its own
query history
data limit
IAM policies
Encryption settings
anti - pattenrs
visualization
quicksight
etl
glue
ACID transactions support
Powered by Apache Iceberg
• Just add ‘table_type’ = ‘ICEBERG’ in your CREATE TABLE command
Remember governed tables in Lake Formation? This
is another way of getting ACID features in Athena.
Redshift
Dataware house PODEROSA
Design for OLAP, not OLTP
SUPER FAST, SCALABLE Y SUPER SHEAP
SQL,ODBC,JDBC interfaces
Use cases
• Accelerate analytics workloads
• Unified data warehouse & data lake
• Data warehouse modernization
• Analyze global sales data
• Store historical stock trade data
• Analyze ad impressions & clicks
• Aggregate gaming data
• Analyze social trends
Composed by clusters
leader node
contiene: compute nodes
Spectrum
query exabytes of unstructured data in S3 without loading
wide variety of data formats
Perfomance
• Massively Parallel Processing (MPP) • Columnar Data Storage • Column Compression
Durability
• Replication within cluster • Backup to S3
Automated snapshots
Scaling Redshift
Vertical and horizontal scaling
on demand
Distribution Styles
Auto
Based on size of data, redshift toma la desicion por ti
Even
round robin fashion, es buena cuando no hay ninguna relacion con otras tablas o join.
Key
se distribuyen segun el valor en la columna, es para hacer queries basados en una columna, las de igual valir estarian en la misma parte y mas rapido el query
All
la tabla se pone en un nodo para fortalecer los joins
Importing / Exportind data
• COPY command : Parallelized; efficient • From S3, EMR, DynamoDB, remote hosts • S3 requires a manifest file and IAM role
es el mas eficiente
usarlo para large amouns of data form outside of redshift
puede decriptar informacion cargada desde s3
• Gzip, lzop, and bzip2 compression supported to speed it up further
• Automatic compression option • Analyzes data being loaded and figures out optimal compression
scheme for storing it
• Special case: narrow tables (lots of rows, few columns) • Load with a single COPY transaction if possible • Otherwise hidden metadata columns consume too much space
• UNLOAD command • Unload from a table into files in S3
• Enhanced VPC routing
copia el trafico que entra en la vpc, hay que tenerla bien configurada para que no guarde trafico de internet
• Auto-copy from Amazon S3
monitorea el bucket de S3, cuando tiene una info la sincroniza
• Amazon Aurora zero-ETL integration •
Auto replication from Aurora -> Redshift
• Redshift Streaming Ingestion • From Kinesis Data Streams or MSK
Redshift copy grants for cross-region
snapshot copies
• Let’s say you have a KMS-encrypted Redshift cluster and a snapshot of it
• You want to copy that snapshot to another region for backup
• In the destination AWS region: 1. Create a KMS key if you don’t have one already 2. Specify a unique name for your snapshot copy grant 3. Specify the KMS key ID for which you’re creating the copy grant
• In the source AWS region: 1 Enable copying of snapshots to the copy grant you just created
DBLINK
Good way to copy and sync data between PostgreSQL and Redshift
Integration
• S3
• DynamoDB
• EMR / EC2
• Data Pipeline
• Database Migration Service
Workload Management (WLM)
Se usa para priorizar queris, pequeños y rapidos vs langos y lentos. Crea una cola de queris y se administra via consola, cli o api
Concurrency scaling
• Automatically adds cluster capacity to handle increase in concurrent read queries
• Support virtually unlimited concurrent users & queries
• WLM queues manage which queries are sent to the concurrency scaling cluster
Automatic Workload Management
Creates up to 8 queues
Default 5 queues with even memory allocation
• Large queries (ie big hash joins) -> concurrency lowered
• Small queries (ie inserts, scans, aggregations) -> concurrency raised
• Configuring query queues
• Priority
• Concurrency scaling mode
• User groups
• Query groups
• Query monitoring rules
Short Query Acceleration (SQA)
• Prioritize short-running queries over longer-running ones
• Short queries run in a dedicated space, won’t wait in queue behind long queries
• Can be used in place of WLM queues for short queries
• Works with: • CREATE TABLE AS (CTAS) • Read-only queries (SELECT statements)
• Uses machine learning to predict a query’s execution time
• Can configure how many seconds is “short”
VACUUM command
Recovers space from deleted rows
Anti-patters
• Small data sets • Use RDS instead
• OLTP • Use RDS or DynamoDB instead
• Unstructured data • ETL first with EMR etc.
• BLOB data • Store references to large binary files in S3, not the files themselves.
Resizing
Elastic resize
cluster is down for a few minutes
Quickly add or remove nodes of same type
Classic resize
Change node type and/or numbre of nodes
Cluster is red-only for hours to days
Snapshot, restore, resize
Used to keep cluster available during a classic resize
Copy cluster, resize new cluster
AQUA
Advanced query accelerator
esta cercad e los servidores de s3, y es mucho mas rapido
Materialized view
es parecido a las vistas pero tienen precomputo, osea poner un query de una base de datos que grande y lo pones en esta vista para que sea mas rapido. tambien sirve para querys complejos
Lambda UDF
Use custom functions in AWS Lambda inside SQL queries
Federates queries
query and analize across databases, warehouses and lakes
sirve para aura, rds, mysql
incorpora data vida in RDS into your Redshift queries
Visualization
QuickSigth
Spice
Data sets are imported into SPICE •
Super-fast, Parallel, In-memory Calculation Engine
• Uses columnar storage, in-memory, machine code generation • Accelerates interactive queries on large datasets
Quicksight + Redshift: Security
• By default Quicksight can only access data stored IN THE SAME REGION as the one Quicksight is running within
• So if Quicksight is running in one region, and Redshift in another, that’s a problem
• A VPC configured to work across AWS regions won’t work!
• Solution:
create a new security group with an inbound rule authorizing access from the IP range of QuickSight servers in that region
Otra solucion: create private subnet in a vpc y luego tener un peering de VPC, para esto se necesita la enterprise edition
Otra solucion es tener un trasit gateway para conectar subnets, or use AWS private link or use VPC Sharing
Quicksignth Q
Es como un chat gpt data como interfaz. ML donde puedes hacer preguntas basadoa en procesamiento del lenguaje natural
the datasets and therir fields must be
NPL-friendly
Security
Encryption in flight (TLS / SSL)
• Data is encrypted before sending and decrypted after receiving
• TLS certificates help with encryption (HTTPS)
• Encryption in flight ensures no MITM (man in the middle attack)can happen
Serve-side ecryption at rest
Data is encrypted after being received by the server
• Data is decrypted before being sent
• It is stored in an encrypted form thanks to a key (usually a data key)
• The encryption / decryption keys must be managed somewhere, andthe server must have access to it
Client-side encryption
• Data is encrypted by the client and never decrypted by the server
• Data will be decrypted by a receiving client
• The server should not be able to decrypt the data
• Could leverage Envelope Encryption
S3 Encryption for Objects
•
SSE-S3
: encrypts S3 objects using keys handled & managed by AWS
• Object is encrypted server side
•
AES-256
encryption type
• Must set header: “x-amz-server-side-encryption": "AES256"
•
SSE-KMS:
leverage AWS Key Management Service to manageencryption keys
• KMS Advantages: user control + audit trail
• Object is encrypted server side
• Must set header: “x-amz-server-side-encryption": ”aws:kms"
•
SSE-C:
when you want to manage your own encryption keys
• SSE-C: server-side encryption using data keys fully managed by the customer outside of AWS
• Amazon S3 does not store the encryption key you provide
•
HTTPS must be used
• Encryption key must provided in HTTP headers, for every HTTP request made
•
Client Side Encryption
• Client library such as the Amazon S3 Encryption Client
• Clients must encrypt data themselves before sending to S3
• Clients must decrypt data themselves when retrieving from S3
• Customer fully manages the keys and encryption cycle
Encryption in transit (SSL/TLS)
Esta soportada, la puedes usar y es recomendada, tiene un https endpoint
AWS KMS (Key Management
Service)
• Anytime you hear “encryption” for an AWS service, it’s most likely KMS
• AWS manages encryption keys for us
• Fully integrated with IAM for authorization
• Easy way to control access to your data
• Able to audit KMS Key usage using CloudTrail
• Seamlessly integrated into most AWS services (EBS, S3, RDS, SSM…)
• Never ever store your secrets in plaintext, especially in your code!
KMS Keys Types
Symmetric (AES-256 keys)
• Single encryption key that is used to Encrypt and Decrypt
• You never get access to the KMS Key unencrypted (must call KMS API to use)
Asymmetric (RSA & ECC key pairs)
• Public (Encrypt) and Private Key (Decrypt) pair
• Used for Encrypt/Decrypt, or Sign/Verify operations
• The public key is downloadable, but you can’t access the Private Keyunencrypted
• Use case: encryption outside of AWS by users
who can’t call the KMS API
Copyng snapshots across regions
KMS no es multiregion, entonces sacas un snapshot en la region 1 luego lo copias en la region dos y ahi ya la kms key es otra que vive en esa otra region
CloudHSM: es para provisionar ecription de
hardware
Seguridad dive deep por cakda servicio visto FALTO!!
AWS STS - Security Token Service
son los tokens temporales que se usan cotidianamente en el SSO
Se usa mucho para feredacion
Tambien se usa para dar acceso temporal a cuentas de IAM, se loguea y se le da un tocken de 15 minutos a una hora para que use esa cuenta
Identity Federation
que es?
• Federation lets users outside of AWS to assume
temporary role for accessing AWS resources.
• These users assume identity provided access role.
Kinesis Enhanced Fan Out
Each consumer get 2 MBS of provisioned throughput per shard
No more 2 MB limit
reduce latency (-70 ms)