Please enable JavaScript.
Coggle requires JavaScript to display documents.
DB Specialty - Coggle Diagram
DB Specialty
RDS Aurora
High Level Architecture
6 copies of our data is stored across 3 AZs, using the shared Cluster storage with self-healing
Read nodes are promoted to the Write node if the Write node fails and the DNS records are automatically updated without you having to worry about any of it
When you create a Aurora cluster, you get two endpoints - one to the writer node and the other to the read replicas
You don't have to specify storage when you use Aurora. It has auto-scaling, self-healing cluster volume storage, which is automatically increased in 10GB increments to accommodate our demand
Auto-scales to handle I/O requirements
Extremely easy to use
Supports up to 15 read nodes
Getting Started
Aurora Serverless
On-Demand autoscaling, good for intermittent or unpredictable workloads, configurable capacity rated in ACUs (Aurora Capacity Units). 1ACU = a combination of approximately 2 GB of memory, corresponding CPU, and networking
When the users connect, they are actually connecting to the Proxy Fleet, which then routes the requests to one of the Aurora instances.
As the workload changes, DB instances are added and returned to the designated DB Pool automatically
Aurora Global DB
Have 1 primary region for writes and up to 5 secondary read-only regions
Data is replicated automatically from the primary to the secondary region using Aurora's dedicated infrastructure
Low latency reads in other regions
The easiest way for SaaS applications to integrate with Aurora is via Aurora's Data API
Aurora Read Replicas
Adding reader nodes provides us with horizontal scalability. The latency is minimal since cluster storage is shared among the reader nodes, which means that a newly added node does not have to wait for the storage to be provisioned
Create Aurora Replica from RDS Source (e.g. RDS MySQL -> Aurora MySQL)
Read Replicas can be used as a near-zero downtime migration option. Not really used for ongoing replication.
An option with some downtime is to stop the write to the primary instance RDS MySQL instance, take a snapshot and restore the snapshot to Aurora instance
Replication from on-prem MySQL DB to Aurora MySQL
Enable binary logging (binlog)
Export data from source instance
Import the dump to Aurora cluster
Configure replication between self-hosted source and Aurora target
When replication lag reaches 0, stop write activity to self-hosted db
Update endpoint in application for the cutover
Replication Lag
In highly concurrent or lock-heavy environments, you may find replication lag to be a concern.
Occurs in RDS to Aurora replication and cross-region Aurora replication
Can be used as failover targets in Aurora. Instances can be allocated across multiple AZs.
Configurable failover priority between tier 0 and tier 15, which assigns specific priority to replicas in the event of a failover. The highest priority replicas (e.g. tier 0) will get promoted to primary.
While Aurora MySQL uses binlog replication, Aurora PostgreSQL uses replication slots.
Replication in cluster is done at storage level and it's only subject to AuroraReplicaLag, which is generally less than 10ms
Managing an Aurora Cluster
Storage Cluster Metrics i.e. for the Cluster volumes (persistent data), not the ebs volumes of the Aurora nodes (temporary data)
VolumeBytesUsed
VolumeWriteIOPS
VolumeReadIOPS
DB Instance Metrics
Various Utilization Metrics (CPU, freeable memory, etc.)
AbortedClients
AuroraReplicaLag
Blocked Transactions
Commit latency
Free local storage
DML and DDL latency provided in Aurora MySQL
MaximumUsedTransactionIDs - important in PostgreSQL
When you see a Low Free Storage notification, it means that the logs or temp tables have consumed a lot of your temp storage (ebs volumes of the db nodes)
Fault Injection Queries - a unique feature in Aurora that allow us to crash the DB, overwhelm it or simulate failure of various components to test the resilience
Backups are not taken while a cluster is stopped.
Multi-Master or Global Database clusters cannot be stopped
If we delete the last instance in the compute cluster from the console, the entire cluster is deleted. In contrast, if we do that via CLI or SDK, the compute cluster will still exist but there will be no instances allocated to it and the storage cluster will still remain. Think of this storage as cold storage that is not used, but we are getting billed for it. When we want to continue using it, we can just add a node to the compute cluster and have immediate access to our data in the storage cluster
Backup and Restore
Backup
Backup retention period is 1 to 35 days
Automatically taken backups
Unlike in RDS, the backups are taken continuously
These automated backups cannot be disabled
Backups are created from the Storage Cluster
PITR
Mostly works like RDS
Latest Restorable Time (LRT)
New cluster is created (instead of an existing cluster being rolled back)
Can be time intensive if we are restoring to a point with a ton of data
Backtrack
Feature that allows us to roll the cluster back to a previous state (rewinds the cluster storage in place, instead of a new cluster being created)
Faster than PITR
Can move both backward and forward. E.g. backtrack to 2hrs ago and then go forward by an hour
It interrupts the cluster operation
Cluster Cloning
Creates a new cluster. Useful for testing purposes
Faster creation than PITR
Connects to source storage cluster and uses copy on write protocol where both clusters share the cluster volume until the data starts to change
MySQL Tools
Native MySQL tools
Inefficient for data larger than 10 GB
Requires source DB interruption
We can use MySQL native mysqldump and mysqlimport to migrate data from any MySQL instance to Aurora
Percona XtraBackup
Uses innobackupx to create backup
Backup can be loaded directly from S3
Intended for new cluster target
Load data from S3
Load flat files like CSV fromatted data from S3
Must already have a running cluster to be able to run this command
Data can be unloaded from Aurora to S3 in a similar fashion
PostgreSQL Tools
Native PostgreSQL
pg_dump and pg_restore
Handles large datasets
Requires source DB interruption
Also has the ability to import csv-formatted flat files from S3
VPC
After DB is created, you cannot change the VPC. This is different from RDS, where we can perform instance modification to migrate resources to a different VPC
Read Replica VPC Migration
Create a new Arora cluster with the VPC that we want to use
When the Replication Lag = 0, the replica has caught up with the source (in our old Cluster 1 hosted in the VPC we don't want to use anymore)
At that point, we drain the connections on the old cluster and update the DNS record that our application points to to point to the new Aurora cluster. That's it - you migrated the Aurora cluster from one VPC to another with little to no downtime.
Endpoints
Cluster Endpoint - always resolves to the writer node
Reader Endpoint - always resolves to one of the read replicas
Custom Endpoints - allows us to define load balancing between the instances ourselves (e.g. when we use different instance types and we want to customize load distribution). You can create up to 5 custom endpoints.
Aurora Global Database
Primary writer region and up to 5 secondary read-only regions
Dedicated fully managed infrastructure is used for replication
Low replication latency (typically less than a second)
RPO of 1s and RTO of 1min, a huge perk for DR planning
Data transfer fees for replication to read-only secondary region do not apply for global tables
Write forwarding - when an application connects to the secondary region and submits write actions (INSERT, DELETE, UPDATE), the secondary region actually forwards these queries in the background to the primary region. These changes, when the primary region applies them, will then get replicated to the secondary region.
Limitations
Can't use cloning
Can't do Backtrack
No support for parallel query (specific to MySQL)
Can't use Aurora Serverless
Can't Stop/Start the Aurora DB
Aurora Serverless
When you connect to the Aurora Serverless Endpoint, you are actually connection to the Proxy Fleet, which then routes the request to the compute cluster that uses ACUs (Aurora Capacity Units to get the data)
Define min and max ACU for autoscaling purposes. Autoscaling uses this info to scale based on CPU Utilization and DB connections. The instances are added from the Warm DB pool.
There is no cooldown for scaling up, but there is cooldown for scale down operations
Scale down cooldown period after a Scale Up is 15mins. If you just scaled up and the utilization drops, you cannot scale down in the next 15mins. If the last operation was Scale Down, then we need to wait only 310 sec in order to scale down again.
Scaling Timeout - It happens when Aurora cannot decide on the right scaling point to meet the required capacity. This can happen due to a long running query or an open transaction. Solution: either wait and retry or activate "Force scaling" (can cause interruptions).
Pause compute capacity after # of consecutive minutes of inactivity. When this is activated, you only get charged for storage. A great solution for infrequent/sporadic workloads.
Created within a VPC and without a public IP address. If you need to connect to it via Lambda, you will need an elastic network interface to connect Lambda to your private VPC.
Security
We can force SSL to ensure it's used on every connection to DB
Can't disable or enable encryption on an existing instance. But, we can restore a snapshot and then enable encryption.
When sharing an encrypted snapshot, we need to have access to the KMS key used for encryption in the target region.
Replica always uses the same encryption config as the source. E.g. if the source is encrypted, the replica must be encrypted, as well.
Aurora Cloning
When we create a clone cluster, the clone shares all data that exists up to that point with the source cluster (they both have access to the same storage volumes). As soon as the source start getting some data changes, a new storage volume for it will be created to capture any data changes that occur after cloning. Similarly, if the clone gets some changes, it will also get its own storage volume to capture the changes after cloning. That way, neither the source nor the clone have access to the changes for each other occurring after the cloning operation.
Clones cannot be created cross-region and the clone has to be created in the same AZ as the source
We can create up to 15 clones in total
It is possible to clone a clone
Clones can be created in other VPCs
Because it uses the same storage volume as the source, provisioning a clone is much faster than snapshot restore operation where all the data has to be loaded into a new volume
Clones can be shared across accounts
Use cases
Test schema changes by cloning a cluster
Parameter Group Experimentation
Disruptive Workloads - if you need to run heavy queries, clone your prod cluster and run queries on the clone
Cross Account
Outside Data Access - Provide access to contractors using AWS Resource Access Management (RAM) to be able to clone the prod cluster instead of access it directly.
Faster than snapshots - Use cloning to more quickly move data between AWS Accounts
Cloned clusters can be backtracked
Aurora Troubleshooting
Connection Remains Open (The application still shows a connection, but the DB has gone down) - Without a TCP keepalive configured, our application will continue waiting for the response for the submitted query.
Application Write Failures (Application continues to connect to old primary after a failover) - Somewhere throughout our networking layer, we are caching DNS or have excess TTL, which results in stale records. This can impact the time to complete failures.
Poor Performance after Failover (While the failover is fast, there is a significant performance degradation after the instance begins accepting connections) - Possible stale buffer cache on newly promoted primary. The cache must be rebuilt to reduce queries retrieving data from disk. Enable Cluster Cache Management (CCM) in PostgreSQL.
Optimize JDBC connections for fast failover
Aurora OOM Crashes (Crashes happen when we try to import a mysqldump, and during peak hours). There is low amount of freeable memory and there is an increase in the "swap" metric.
You can share an Aurora DB cluster with another AWS account or AWS organization. By sharing this way, you can clone the DB cluster and access the clone from the other account or organization.
In MySQL Aurora, you can create a stored procedure that gets triggered on each new row insert and that calls mysql.lambda_async function to invoke a Lambda function. Lambda function can have Aurora as a trigger and it can send a notification to an SNS topic on each record creation.
Disaster Recovery
Objectives
RPO (Recovery Time Objective) - The maximum amount of data or time that we can lose without impacting the business operations
RTO (Recovery Point Objective) - The maximum amount of time that the recovery process can take. i.e. "For how long can we be down?"
Monitoring - How will you monitor for failures/problems?
Backups - How will you set up your backup to meet your RPO and RTO requirements?
Resiliency - How many AZs or regions do you use?
Reliability Design Principles
Automatically recover from failure
Scale Horizontally and use smaller resources, rather than huge monoliths
Identify Capacity Requirements to have a clear understanding of your system's requirements.
Perform changes through automation (e.g. via CloudFormation) to have change management, etc.
Test your recovery strategy
Migration
Methods
RDS Snapshot Migration - very common for homogeneous migrations
Managed point and click service available through the AWS Console
Best migration speed and ease
Can be used with binary log replication for near zero migration downtime
Percona XtraBackup - popular in migration from MariaDB to MySQL and On-prem MySQL to RDS
Managed backup ingestion from Percona XttraBackup files stored in an S3 bucket
High Performance and can be used with binary log replication for near zero migration downtime
Other self-managed Export/Import options
Schemas can be migrated as-is without conversion
Data migration can be performed manually using existing, well documented command line utilities
Can be used with binary log replication for near-zero migration downtime
DMS
Generally, you don't want to use DMS for homogeneous migrations as your first option since it's a bit more complex than the simple tools like snapshot migration and others listed above
Managed point and click data migration service
Schemas must be migrated separately (using SCT). The two steps in migration are Schema Migration and Data Migration.
Supports CDC replication for near-zero migration downtime
The best tool for heterogeneous migrations to Aurora (e.g. DB2 on prem to Aurora MySQL)
DMS
Can be used to consolidate multiple DBs into a single one
At least 1 endpoint in the migration must reside in AWS
Provides automatic failover for the replication server
Ensures data security at rest and in transit
Steps
Create a replication server
Create source and target endpoint that have your data stores
Create one or more migration tasks to migrate data between the source and target data store
Extra Connection Attributes (ECA) can be specified to customize the behavior of the endpoints (e.g. whether or not to enable supplemental logging)
Migration Modes
Full load only
Full load followed by CDC
CDC replication only (e.g. if you've already restored a snapshot yourself)
Table Preparation mode - configured for the tasks to perform
Do nothing mode - assumes that the tables already exist in the target and that the task should just dump the data there
Drop tables on target mode - drops and recreates the tables in the target
Truncate mode - data deleted in the target table before migration begins
LOB (Large Binary Objects) - If possible, you want to know what the largest object will be in the migration, so that you can optimize your migration by configuring Max LOB size
Full LOB mode - DMS migrates all LOBs from source to target regardless of size
Limited LOB mode - you set a maximum size LOB that AWS DMS should accept
Sources
On-prem: Oracle, SQL Server, MySQL, MariaDB, PostgreSQL, MongoDB, SAP Sybase ASE, IBM DB2 for LUW
Azure: Azure SQL Database
RDS and S3: Oracle, SQL Server, MySQL, MariaDB, PostgreSQL, Aurora with MySQL or PostgreSQL capability, S3
DocumentDB
Targets
On-prem: Oracle, SQL Server, MySQL, MariaDB, PostgreSQL, SAP Sybase ASE and several others
Redshift, DynamoDB, Kinesis Data Streams, Kafka, ElasticSearch, DocumentDB and Neptune
Data Validation
DMS has support for data validation to ensure your data was migrated accurately.
When enabled, validation begins immediately after a full load is done.
For CDC-enabled tasks, Data Validation feature compares the incremental changes.
During data validation, DMS compares each row in the source against the target, verifies same data is present and reports any mismatches.
There are a ton of data validation settings that we can configure (e.g. SkipLobColumns, FailureMaxCount, etc.)
Various metrics can be monitored in CloudWatch during Data Validation (e.g. The number of validated rows per minute, The number of failed validations per minute, etc.)
DMS creates a table on the target endpoint (awsdms_validation_failures_v1), where it writes diagnostic information about failures. Helpful table in troubleshooting.
Cross-Account Migration
Set up VPC peering
Homogeneous migration - Create an RDS snapshot and restore it in another region. Then, enable CDC-only tasks in DMS to replicate ongoing changes.
Heterogeneous migration - Use SCT to convert schema and then create DMS task to perform both the Full Load and ongoing CDC replication
Optimizing DMS Tasks
Slow DMS Tasks
The task instance may be too small, especially in scenarios when we have multiple tasks running on the task instance
The load from the reads on the source db
The load from the writes on the target db. If you have Data Validation enabled, your target will also be hit with read requests
If we need to, we can break our large migration tasks into multiple tasks on multiple instances.
Recommendations
For an RDS DB instance, disable Multi-AZ for the target instance and transaction logging.
Turn off automatic backups
Use Provisioned IOPS if available
Make sure the task is optimized for LOB migration (use Limited LOB mode, which is the default, and ensure you correctly configure it)
DMS Network Issues
Make sure your use SG that permit traffic on ports to and from replication instance
If your target is in VPC, but the source is somewhere else (on-prem, other vpc) then the route table for you replication instance needs to have a route to the source
Initial Load of a Schema Fails
Make sure the user account used by the DMS to connect to the source endpoint has the right permissions
FKs and secondary indexes missing
Note that DMS does not create secondary indexes, non PK constraints or data defaults. Use DB-native tools or SCT for that.
CDC stuck after Full Load
DMS settings can conflict with each other, which could cause slow or stuck replication.
If you haven't created PK on the target tables, full scans will occur, which are very resource heavy
SCT
Can convert from OLTP or data warehouse schema
Setup
Download SCT installer
Extract the installer for your OS
Run the installer
Install the jdbc drivers for the source and target DB engines
DB Migration Assessment Report - summarizes all the schema conversion tasks and details the action items for schema that can't be converted to the DB engine of your target instance (e.g. licensing evaluation, feature comparison, assess current hardware, makes recommendations for backups, etc.)
Usage
Create mapping rules in the SCT
Convert the schema using the SCT
Create migration assessment reports
Handle manual conversions in the SCT
Update and refresh the converted schema
Save and apply the converted schema
Data Extraction Agent
Useful in scenarios where the source and target are very different and require additional data transformations.
It's an external program that's integrated with SCT, but performs data transformation elsewhere (such as an EC2 instance or on Snowball Edge)
For very large DB migrations, you can use a SCT replication agent to copy data from your on-premises DB to S3 or Snowball Edge Device. The replication agent works in conjunction with DMS.
You can use an SCT data extraction agent to extract data from Apache Cassandra and migrate it to DynamoDB. The agent runs on an EC2 instance, where it extracts data from Cassandra, writes it to the local file system, and uploads it to an S3 bucket. You can then use AWS SCT to copy the data to DynamoDB.
AWS Workload Qualification Framework (WQF) is a module within SCT that helps us analyze the entire process of migrating customer enterprise infrastructures and makes recommendations on a migration strategy and proper migration tools
Glue
Billed on per second basis for crawlers and ETL jobs
Storage for the first million objects is free
Encryption in transit and at rest supported and can be configured manually
Snowball Edge
Move TBs or PBs of data without paying for network data transfer costs and delays
Use Cases
When network bandwidth is the limiting factor or there is no Internet access
Massive amount of data to be migrated (e.g. 100 TB)
The process
Use SCT to extract data locally and load it to Edge device
Ship the Edge device or devices back to AWS
At AWS, Edge device automatically loads data to an S3 bucket
DMS takes the flat files from S3 and migrates data to the target
Monitoring and Optimization
Trusted Advisor
Real-time guidance on saving money and following best practices.
Cost Optimization
Underutilized EBS volumes
RDS idle DB instances
Underutilized Redshift clusters
ElastiCache Reserved Node optimization
Redshift Reserved Node optimization
RDS Reserved Node Optimization
Fault Tolerance
Helps increase the availability and redundancy of an application, by taking advantage of auto scaling, health checks, multi AZ and backup capabilities
Reports on the existence of backups and checks on the automated backups of RDS DB instances. Backups are enabled with the retention period of 1 day by default.
Checks if encryption is enabled on S3 bucket, if security groups are too wide open, whether MFA is enabled on root account, etc.
You can subscribe to Trusted Advisor's notifications, which are sent out weekly and include a report for your environment over the past 7 days
CloudWatch Application Insights
Powered by ML (SageMaker), let's you monitor for problems with your resources by detecting anomalies and generating dashboards
Monitors .NET and SQL Server applications: App discovery and config, data preprocessing, intelligent problem detection.
Supported DB engines: SQL Server, MySQL and DynamoDB
Advance Audit Logging
Records database events such as connections, disconnections, tables queried or types of queries on Aurora MySQL DB cluster, RDS MySQL or RDS MariaDB
Audit log files are comma-delimited, and include the following info in rows: timestamp, serverhost, username, host, connectionid, queryid, operation, database, object.
Notable Parameters - you can set these params in the parameter group used by your DB cluster to configure Advanced Auditing
server_audit_logging - enable/disable Advanced Auditing
server_audit_events - specify what events to log
server_audit_excl_users and server_audit_incl_users - specify who gets audited
CloudWatch Contributor Insights
Analyzes log data and determines top contributors influencing system performance
Helpful in determining frequently accessed attribute keys in a DynamoDB table
Direct Connect does not encrypt the data in transit and it doesn't offer na option to do it. You have to encrypt it yourself.
Graph
Pros
Easily relate sets of data
Very flexible schema
Great for social networking applications
Cons
Performs poorly on high-volume transactions
Design
Data is broken down into nodes (like records in RDBMS), edges (links that connect the nodes) and properties (related information attached to the nodes).
Services
Neptune
Supports Gremlin and SPARQL (uses RDF data model)
Uses cluster volume, similar to Aurora
Supports Multi-AZ and up to 15 replica
Uses self-healing, fault-tolerant cluster volume
Most efficient method of loading data is by using Neptune's Loader to load data from S3, which can be invoked via HTTP call & supports loading data for both Gremlin & SPARQL (rdf format)
Relational
Pros
Best with structured data
Atomic operations (completes successfully or fails, it's all or nothing)
Cons
Bad for semi-structured or sparse data
Services
RDS
AWS manages failure detection, backups, recovery
Engines
Oracle
PostgreSQL
SQL Server
MariaDB
MySQL
Storage Engines
InnoDB - fully supported in RDS. Point-In-Time restore and snapshot restore features require a recoverable storage engine and are supported for the InnoDB storage engine only
MyISAM - does not support reliable recovery and can result in lost or corrupt data when MySQL is restarted after a recovery
Federated Storage Engine is currently not supported by Amazon RDS for MySQL.
Aurora
Automated backups
Automatically scales in 10GB increments
Uses cluster storage, rather than the EBS volumes. This is one of the key differentiators between Aurora and RDS. Cluster volume is shared between the Primary DB and the Read Replicas, so we are not limited by logical replication.
Supports Multi-Master
Supports up to 15 read replicas
Multi-AZ - standby instance in another AZ
We can convert existing single-AZ instance to multi-AZ instance. When we kick off this option, RDS starts to take the snapshot of the primary instance and, when done, copies the snapshot to another AZ, where it starts to spin up a secondary active instance. However, note that there could be some performance hits during the process of taking snapshot of the primary instance/
Unlike in read replicas where the changes are made asynchronously, with a delay, in Multi-AZ mode, since both instances are considered active, the changes are made synchronously. The data must be replicated to the secondary active instance before the primary instance considers the transaction complete.
Note that Multi-AZ gives us availability, while Read replicas give us scalability.
When a disruption of primary instance happens, RDS automatically promotes the secondary instance to be the primary and sets the endpoint to point to the new primary instance.
Recovery for single-AZ failure can take 4-5mins. Recovery from Multi-AZ failure is typically achieved in 60s or less.
The customer is responsible for managing performance bottlenecks with queries. RDS automates recovery from Storage Failure, Network Interruptions, Loss of Availability (e.g. hardware failure)
You can choose to convert an instance to Multi-AZ either immediately or schedule it to happen during the next maintenance period.
When you enable Multi-AZ, you get charged for 2 instances since they are both active.
Read-replicas - can be promoted to a primary instance as part of the disaster recovery
SQL Server and Oracle don't support manual snapshots
You can have up to 5 read replicas for each database instance
SQL Server requires you to enable multi-AZ with always on availability groups on primary instance. Also, it only supports read replicas in its Enterprise edition.
Oracle also requires you to purchase enterprise edition in order to enable read replicas
Global read replicas are supported (up to 5 per instance), where you can create replicas in different regions. Only SQL Server doesn't support global read replicas.
You can't use a read replica as a source for you cross-region replica
The read replicas communicate with the primary instance via an encrypted channel fully managed by RDS.
How replication is done varies engine-to-engine. MySQL and MariaDB both use Logical replication, while the other engines use physical replication.
Replica lag can be risk in some applications since there is a delay between the primary instance writing data to its db and the time that the same data gets replicated to the secondary read replicas.
Asynchronous replication used to replicate changes from the master to the replicas
For MariaDB, MySQL and Oracle RDS, when a DB is deleted, all the read replicas are promoted. For PostgreSQL, when a DB is deleted, only the read replicas in the same region are promoted, while the cross-region read replicas get a replication status of "terminated".
Backups
Automated
Enabled by default. You can disable them, but it's not advisable. You disabled automated backups by setting retention period to 0
Occur in 30min backup windows
Configurable backup retention period 1-35 days (default is 7 days)
Contain system-wide (the entire volume, not just the DB) snapshots, as well as, transactional. Snapshots stored in S3.
Automated backups don't capture any information about the parameter or option groups. If you restore an automated snapshot, you have to specify these yourself.
When you delete an instance, by default, automated snapshots also get deleted. You can change this to retain them.
If you want to share a snapshot, you must first copy it to a new Snapshot and then share it
Up to 40 automated backups can be retained per region
Point In Time Recovery (PITR) - restore to any specific time within the configured backup retention period
LRT (Latest Restorable Time) - generally within 5mins of current time since transaction logs are sent to S3 every 5mins.
When sharing snapshots, don't delete the source snapshot before the transfer completes.
Snapshots can be copied to other regions, but this process may be slow.
If a snapshot is encrypted and shared with your account, you cannot directly launch an instance from this snapshot. Instead, you have to copy it and that copy also needs to be encrypted.
Oracle and SQL Server snapshots that use Transparent Data Encryption (TDE) can't be shared.
Manual
Stored in S3 indefinitely. There until we delete them.
First snapshot will take the snapshot of the entire DB. The later ones will be incremental. Therefore, the first snapshot may have a strong hit on I/O latency on single AZ instances
The more data you intend to restore, the longer the restoration process will take.
Allows you to restore to a different storage type
When restoring from manual snapshot, the default parameter and option groups will be automatically set to the same as the ones used by the instance from which the manual snapshot was taken. During restoration, you can override these with different values.
You can share snapshots as-is. You can't share it with other accounts if the snapshot was encrypted with the default key. If you used a CMK KMS, you can share the snapshot.
Monitoring
Events and Event notifications
SNS notification includes SourceType (such as db-instance) and SourceIdentifier (such as MyExampleDB)
SNS topic get request on failovers, config changes (like parameter groups or option groups), etc. Use SNS notification to trigger Lambda and automate any actions.
RDS Event Notification is a native feature on RDS that enables you to receive notifications on various DB events, such as when the master password has changed.
Enhanced Monitoring
access to real-time metrics from underlying OS, access to Free Memory. Collection granularity can be adjusted down to 1 sec refresh, as opposed to the standard CW metrics that refresh only once every minute.
CW retrieves its metric from the hypervisor, rather than the underlying OS. So, our view is limited to the metrics that are exposed to the hypervisor. This is why enhanced monitoring is helpful, because it gives us view into the OS metrics, as well. Enhanced Monitoring Agent, running on the managed EC2 instance hosting the DB collects the OS metrics and sends them to CW.
CW shows us Freeable Memory (how much of memory could be freed up, used by cache and buffer pools), but in order to see Free Memory metric, we need Enhanced Memory. Important to track in production as we don't want to have any spills to disk that slow us down.
Performance Insights - Provides helpful insights into workload performance such as db load, top queries, etc. Helpful to teams that don't have a dedicated DBA.
Generates visualizations of metrics
Can be enabled on creation or during the modification process. No downtime, reboot nor failover will be required.
The PI agent is lightweight and consumes limited CPU and memory on the DB host.
If DB load is high, the PI agent collects data less frequently
PI Dashboard - contains visualizations of various load metrics and you can drill in based on a particular wait state, SQL Query, host or user.
Automatically publishes insights to CloudWatch, via metrics like DBLoad, DBLoadCPU, DBLoadNonCPU
PI has its own set of APIs that you can use to retrieve metrics
Important Dashboard Metrics
Counter Metrics - Can be viewed via Performance Insights. Monitor specific performance metrics. Varies depending on which DB engine is used. e.g. Memory, SwapSpace, AbortedConnections, etc.
DB Load - compares DB load (measured in average active sessions i.e. opened connection that send a request to DB and are waiting on a response) to the maximum instance capacity. If AAS is near 1, the DB is fully utilized. If AAS is near 0, the DB is idle.
Top Items - Slice the metrics by top waits, SQL, hosts and users.
Instance Statuses
failed, restore-error and incompatible statuses (there are 3 - network, parameters, restore) can result in severe outages. You are not billed when an instance is in incompatible or failed state.
10 statuses in total
If you get incompatible-parameters state, copy your parameter group, make changes and then apply that new parameter group.
restore-error status indicates that PITR has failed.
Logs
MySQL & MariaDB
Binary logs used to track query activity. Configurable retention using rds_set_configuration stored procedure
Types
Error
General_log - granular record of all activity in the DB
Slow_query - capture any query that takes longer than, say, 5s to execute
Audit - requires MariaDB Audit plugin
Logs Output
FILE
Required setting if you want to interact with the logs via console
by default, logs older than 24 hours are deleted
If the size of the logs is greater than 2% of all storage, then RDS will continually remove the oldest logs until they take up less than 2%.
TABLE
stored and viewable in the engine
records older than 24hrs are moved to the backup table
The rotation occurs if the size of the logs is more than 20% of the storage or if the size exceeds 10GB.
PostgreSQL
postgres engine combines most of the logs into the postgres.log file
a feature to export postgres logs to CW logs
The logs are centralized in the postgres log. To enable additional logging, you just need to make changes in the parameter group
log_statement - defines what query activity we want to log in the DB
log_min_duration - threshold for when the query is logged.
log_retention_period - configured to retain the logs for up to seven days. Measured in mins. Be careful with how much you log as it could impact DB's I/O.
Oracle
Types
Audit
Alert
Trace - diagnose or resolve operational issues
Listener - track connections made to the DB
Supplemental Logging and Online Log Files - might be required when using log miner or when you want to track changes across multiple DBs.
Force Logging - used to log all changes except of those in temporary tablespaces
Most Oracle logs are retained for a default of 7 days
SQL Server
Types
Error
Trace
Agent
Dump
Retained for 7 days
Error and Agent logs can be exported to the CloudWatch Logs and then view from the RDS Console
Can be viewed via console, CLI/API or via stored procedure executed on the DB engine
Security
Access resource using Roles
Assume Role - User assuming cross-account role must be identified as a principal in the role's trust policy
STS GetFederationToken used to get temporary credentials
Service-Linked Roles - RDS uses AWSServiceRoleForRDS role that allows it to call other AWS services. It's created for you when you create a DB instance.
AWS Managed Policies
AmazonRDSReadOnlyAccess
AmazonRDSFullAccess
External Authentication
Support for Kerberos and Microsoft AD
Supported only by MySQL, SQL Server, Oracle and PostgreSQL
You can use AWS Directory Service for Microsoft AD or your on-prem AD. You can connect your existing AD domain to your RDS SQL Server DB using AWS Directory Service.
IAM Authentication
Supported by only MySQL and PostgreSQL
Can be enabled during instance creation or modification
MySQL limited to 200 connections/second
PostgreSQL does not have limits, you just have to set SSL parameter to 1
Use a generated authentication token. The tokens expire after 15mins
In order for IAM Authentication to work, RDSDBConnect IAM policy needs to be assigned to the user and a user needs to be created in the DB.
KMS
You can't enable or disable encryption for a preexisting DB instance. Encryption must be enabled at the time of creation.
Can't create an encrypted read replica if the source is not encrypted. Both the primary and all replicas must either be encrypted or unecrypted.
You can't restore an encrypted snapshot to the unencrypted DB instance.
Keys are region-specific. If you want to restore an encrypted snapshot from one region to another, you must have a valid KMS key in that other region that can be used.
SSL/TLS Support
Set rds.force_ssl parameter to 1 to force all users to use SSL in order to connect to the DB.
To get a root certificate for all AWS Regions, download it from
https://s3.amazonaws.com/rds-downloads/rds-ca-2019-root.pem
RDS Subnet Groups
Grouping of subnets from multiple AZs. Logical grouping that we create in our VPC.
Subnet group must contain at least 2 AZs
Cannot be a mix of public and private. They all gotta be either public or all private.
You can modify the VPC of the database instance by modifying the subnet group.
Security Groups - 3 types of security groups are used in RDS: VPC Security Groups, DB Security Groups and EC2-Classic security groups. Only the first one is relevant these days as the later 2 only apply to EC2 Classic
Managing RDS resources
Modifying storage
When storage optimization begins, the DB instance is placed in storage-optimization status. After it completes, you cannot do any other storage modifications for 6 hours.
When you are adding storage, you have to add at least 10% of the existing storage
You can't reduce the amount of storage using this method because RDS service doesn't know how the data is spread across the disk. So, to reduce the volume, you'd have to migrate to a DB instance with smaller volumes.
Storage mods that result in outage:
Any change of volume types from GP2 and IO1 to Magnetic or from Magnetic to GP2 and IO1
Change from IO1 to GP2 or GP2 to IO1 in a single-AZ DB instance, but only if custom parameter group was used.
No outage happens when you have Multi-AZ
You can use RDS autoscaling to automatically scale your storage up. Note that you cannot scale down. Also, you should configure the maximum storage to which your db will scale to avoid having it scale to astronomical sizes.
Instance Modifications
Changes to instance class using "Apply Immediately" option result in outage
Without Apply Immediately option, the changes are applied during the next maintenance window
In Multi-AZ, the instance change is for performed on the standby instance, drastically reducing downtime. After standby mod is complete, RDS performs a failover to the secondary to promote it to become Primary. Then it applies mods to the original Primary instance.
Changing Public setting is done immediately and does not result in any downtime
Parameter Groups
Controls engine config values
Static and dynamic parameters
Cannot modify default parameter group. You have to create a custom parameter group and associate it with the instance.
When you modify your instance to apply a new Parameter Group, the Parameter Group status will change to "pending-reboot" status. Know that this reboot will not automatically happen during the next scheduled maintenance window. You have to do the reboot manually yourself.
Option Groups
Used to enable additional features, such as Maria DB audit plugin for MySQL
Persistent options can't be removed from an option group while DB instances are associated with the option group.
Links to VPC rather than DB instance. Cannot be migrated between VPCs.
Note that some extra features are not available in Options Groups. For instance, Option Groups do not support SQL Server Reporting Services feature. If you want to use this feature, you have to host SQL Server on an EC2 instance.
Permanent options, such as the TDE option for Oracle Advanced Security TDE, can never be removed from an option group.
Maintenance
Status Types
Required - must be applied. You can defer it, but not indefinitely since the RDS determined that the update is necessary.
Available - update available, but not required to be applied immediately. You could also do nothing.
Next Window - Upcoming maintenance will take place in the next window.
In Progress - maintenance tasks are already being applied to the resource
Maintenance Windows
A 30min window in an 8-hour time block.
If the maintenance tasks take longer than 30mins, they will still run until they finish (there is no hard cutoff).
The 8hr block varies by region. If you don't specify a 30min window, it is selected at random.
DB Engine Maintenance will result in downtime, even if you have Multi-AZ, since the change has to be applied to both instances at the same time. However, with Hardware and OS maintenance, Multi-AZ will help since those changes can be applied to standby instance first.
Minor Version Engine Upgrades - backward compatible changes that can be automatically applied if you enable automatic version upgrades feature.
Major Version Engine Upgrades - typically not backward compatible. Have to apply it via instance modification process. Per best practice, always take a backup before applying any major version upgrades.
Troubleshooting
Parameter Changes Not Taking Effect - verify that the status of the instance is "pending-reboot". Perform the reboot of the resource via CLI and verify that the parameter group change has taken effect.
Instance in Storage-Full State - Anticipate the increased requirements of storage (e.g. upcoming migration) and perform a database instance modification to increase storage allocation ahead of time. You can also configure a CW Alarm to monitor the FreeStorage metric.
Replication is Stopped - It can happen if the master and the replicas are using different parameter and option groups (e.g. you could see max_allowed_packet error). To fix this, make sure you set the same values in both master and replica. If the replication has lagged a lot, just create a new read replica and take down the first one.
Insufficient Capacity Error - select a different DB instance class or try to create the resource in another AZ.
Too Many Connections - Increase max_connections parameter value
Pricing Models in RDS
DB Instance Pricing (instance type and size)
Storage Type
General purpose (gp2) - max IOPS is 16,000
Provisioned-IOPS (io1) - max IOPS is 64,000
Magnetic - super cheap AND super slow
Usage Type
On-Demand
Reserved
DB Instance Storage
I/O (Magnetic only)
Backup Storage - Billed on automated and manual backup snapshots. Billed per GB.
Data Transfer
Tiered based on the amount transferred
If you copy DB snapshots across regions, it will cost you.
Traffic within the AZ is free
Traffic between RDS and EC2 in different AZ accrues EC2 regional data transfer fees.
Traffic replicated across AZs for Multi-AZ is free
Licensing - customers already using Oracle, can use their license on RDS
Reserved Instance pricing - license included or BYOL
On-demand pricing - only BYOL supported
You can stop your RDS DB instance for up to 7 days to save on cost. After several days, the DB instance is automatically started to perform maintenance. You can stop your RDS instance in either single AZ or Multi-AZ. One limitation is SQL Server, which can be stopped only when in Multi-AZ.
Protection against the man in the middle attack
Aurora MySQL: ssl-mode=require only enforces encryption. Set to "verify-full", to force encryption and verify the certificate
Aurora PostgreSQL: Set ssl-mode to "verify-full", which forces encryption and verifies the the certificate. Note that ssl-mode=require would only enforce encryption
SQLServer: encrypt=true; trustServerCertificate=false
Oracle: Set ssl_server_dn_match to true
RDS Proxy - allow your applications to pool and share database connections to improve their ability to scale. RDS Proxy makes applications more resilient to database failures by automatically connecting to a standby DB instance while preserving application connections. RDS Proxy also enables you to enforce AWS Identity and Access Management (IAM) authentication for databases, and securely store credentials in AWS Secrets Manager. Fully compatible with only MySQL and PostgreSQL.
You can load XML data from S3 to your table in RDS by running LOAD XML FROM S3 SQL statement
Redshift
Columnar type
Leader Node vs. Worker Node
Data is stored in slices
Dense Compute vs. Dense Storage vs. RA3
EVEN vs. KEY vs. ALL vs. AUTO distribution style
Audit logging is not turned on by default in Redshift. When you turn it on, the logs are stored in S3.
Key-Value
Cons
You have to know the access pattern from the beginning as it will have a huge impact on your design and experience
It is not great for analytics
Pros
Low latency
Extremely scalable
High throughput
Flexible schema
Great for storing sessions of web applications, handling surge in orders with great scale and very fast response times
DynamoDB
DynamoDB Architecture
Tables are of unlimited size. Table names must be unique per region. Case sensitive, can include _, - and . symbols
Partition Key - Primary Key. Select a column that has many different values, ideally evenly distributed. Also called hash key.
Composite Primary Key: a combination of primary and sort key. Also called range key
Each item can be up to 400 KB in size. Nested values in an item can be 32 levels deep.
Data Types
Scalar - number, string, binary, boolean and null. Apps must encode binary values in base64-encoded format before sending them to DynamoDB
Document - complex structure with nested attributes (e.g. json)
List - ordered collection of values (don't have to be of the same data type)
Map - unordered collection of name-value pairs (similar to json)
Set - multiple scalar values of the same type - string set, number set, binary set
Item Collection - any group of items that have the same partition key value in a table and all of its local secondary indexes. The maximum size of any item collection is 10 GB.
Performance
On-Demand Capacity
Good for
New tables with unknown workloads
Applications with unpredictable traffic
Prefer to pay as you go
Characteristics - scales with demand
Provisioned Capacity
Good for
Applications with predictable traffic
Applications whose traffic is consistent or ramps gradually
Capacity requirements can be forecasted, helping to control costs
Characteristics
consistent and predictable performance
Specify RCUs and WCUs
Cheaper per request than On-Demand mode
If you exceed the provisioned throughput, your table starts to get throttled (apps will start to get exceptions)
Limit for both capacity modes is 40k RCUs and 40k WCUs. You can switch between modes only once per 24 hours.
You can use Parallel Scans, where a multithreaded application can send request parallel scan requests with each thread specifying Segment and TotalSegments arguments
Note that GetItem will always be faster than Query since it takes us straight to the physical partition that has the PK that we want to pull. The most performant method when we want to pull a specific PK.
Overview
DynamoDB Streams for Real-time data processing
Storage autoscaling - set min and max provisioned capacity
DAX - in-memory caching feature
Supports Multi-region Multi-master mode (Global tables)
Metrics
ConsumedReadCapacityUnits / ConsumedWriteCapacityUnits - number of units consumed over a given time period. Use these to see how much of your provisioned capacity has been used.
ProvisionedReadCapacityUnits / ProvisionedWriteCapacityUnits - The total number of provisioned units for a table or an index.
ReadThrottleEvents - requests to DynamoDB table that exceed the provisioned read and write capacity units
SuccessfulRequestLatency - the lapse time for a successful request and the number of successful requests.
SystemErrors - requests to DDB that generate the HTTP 500 error code
ThrottledRequests - any event within the request exceeds the provisioned throughput limit
UserErrors - requests to DDB that result in HTTP 400 error
WriteThrottleEvents - when write throughput capacity exceeded for a table or an index
Alarms
States
INSUFFICIENT
ALARM
OK
Key Components
Metric
Threshold
Period
Action
Errors
Components
HTTP Status code
Exception name
Error Message
Exponential Backoff - built into the sdk and automatically applied
Batch operations - tolerate single record failures within the batch. The most common error is due to throttling. Any throttled records within the batch are retried using exponential backoff
When provision throughput is exceeded and the requests are throttled, the client will receive a 400-level HTTP code
Partition Keys
Used as input to DDB's internal hash function that selects a partition
each partition holds up to 10GB of data and get 3000 RSU or 1000 WCU
Consistent Hashing Algorithm is used to determine which partition should go to which node
Sort key cannot be used on its own. We always have to use it in conjunction with the partition key.
RCUs & WCUs
One RCU represents one strongly consistent read request per second, or 2 eventually consistent read requests, for an item up to 4KB in size. Transactional read requests require 2 RCUs for items up to 4KB in size. Eventually consistent model is the implied default method if "strong" is not specifically mentioned.
Even if you read a very small item, you are still consuming at least 1 RCU.
One WCU represents one write per second for an item up to 1 KB in size. Transactional write requests require 2 WCUs for items up to 1KB in size.
Scan vs. Query
Scan
Returns all items and attributes in a given table
Filtering results does not reduce the RCU consumption, it simply discards non-matching items after the whole table has already been scanned
Eventually consistent by default, but you can enable strong consistency scans with ConsistenRead parameter
Use "limit" parameter to limit the number of items scanned and reduce the capacity required
Single query returns results that fit within 1MB, but we can use pagination to retrieve more than 1MB
Query
Find items based on PK values (required attribute) and returns all items with that PK value. Query can only be executed on tables that have both PK and SK, even though SK is not a required argument in the query.
Query limited to PK, PK + SK, or secondary indexes
Filtering results does not reduce RCU consumption (same concept as in filters applied on scans)
Eventually consistent by default
Querying a partition only scans that one partition
Limit and pagination are used in the same way as with scans
BatchOperations - improve the performance and cost of the read and write requests
BatchGetItem
Returns attributes for multiple items from multiple tables
Request using PK
Returns up to 16MB of data, up to 100 items
Get unprocessed items exceeding limits via UnprocessedKeys
Retrieves items in parallel to minimize latency
BatchWriteItems
Writes up to 16MB of data, up to 25 put or delete requests
Get unprocessed items exceeding limits via UnprocessedItems
Conditions are not supported for performance reasons
Threading may be used to write items in parallel
Provisioned vs. On-demand Capacity
Provisioned
Need to specify minimum and maximum capacity
Subject to throttling
Autoscaling available
Lower cost per API call
The first 25GB of storage is free
On-demand
Idle tables not charged for read/write, but only for storage and backups
No need to plan and specify capacity
use this mode for new product launches and then convert to provisioned capacity once your application reaches predictable steady state.
Even though you should not see throttling in on-demand mode, if you double your previous peak of traffic within 30mins, you may still see throttling even in on-demand mode, as dynamodb attempts to catch up with the requests.
You can switch between Provisioned and On-demand mode only once every 24 hours
Indexes - a subset of attributes from a table, along with alternate key to be used in queries. A table can have multiple secondary indexes.
Global Secondary Index (GSI) - An index with a partition key and a sort key that can be different from those on the base table. The primary key of a GSI can be either simple (partition key only) or composite (PK + SK). You can create GSIs any time you'd like.
Local Secondary Index (LSI) - An index that has the same partition key as the base table, but a different sort key. The primary key of a LSI must be composite (PK and SK). Must be created at the table creation. A table with local secondary indexes can store any number of items, as long as the total size for any one partition key value does not exceed 10 GB.
Index Projections - a set of attributes copied from a table into a secondary index. The PK and SK are always projected into the index.
ALL - All the table attributes are projected into the index
KEYS_ONLY - Only the index and primary keys are projected onto the index
INCLUDE - Only the specified table attributes are projected into the index
Sparse Indexes - If the SK doesn't appear in every item in a table, the index is considered to be sparse since DDB writes to the index only items that have the SK present. GSIs are sparse by default since SK is optional to be specified when GSI is created. By creating sparse indexes, you can provision GSIs with lower write throughput than that of the parent table, which can save a lot of money that would otherwise be spent on storage (for no additional benefits).
In general, you should use GSIs rather than LSIs. The exception is when you need strong consistency, which only a LSI can provide (GSI queries only support eventual consistency)
Backups
Point In Time Recovery (PITR)
Continuous backup that protects you from the accidental updates or deletes
You can return to any point in time in the past 35 days.
DDB maintains incremental backups of your table
The latest restorable timestamp is typically 5mins in the past
Not enabled by default
After restoring a table, you manually have to set up auto scaling policies, IAM policies, Cloudwatch metrics and alarms, Tags, Streams settings, TTL settings and PITR settings. None of this is automatically set up by DDB for you, so you have to do it.
On Demand
Takes a full snapshot of the complete table
Can be restored at any time without any impact to the table performance
Consistent within seconds across thousands of partitions and retained until manually deleted.
A good fit for archival and compliance purposes
Operate within the same region as the source table
Transactions
perform atomic writes and isolated reads across multiple items and tables
TransactWriteItems
transact up to 25 items or 4 MB
Evaluate conditions and if all conditions are simultaneously true, perform write operations
TransactGetItems
Transact up to 25 items or 4MB
return a consistent, isolated snapshot of all items
In terms of capacity throughput, DDB performs 2 underlying reads and writes of every item in the transaction - one to prepare the transaction and one to commit it.
Apply transactions to items in
Same PK or across PKs
Same table or across tables
Same region only
Same account only
DDB tables only (e.g. can't transact with DDB and RDS SQL DB)
Scale like the rest of DDB
Reasons for transactions failure: Precondition failure, insufficient capacity, transactional conflicts, transaction still in progress, service error, malformed request, permissions.
TTL
No cost of using TTL
Does not consume provisioned throughput
Use cases: session data, Events logs, etc.
Expired items are typically deleted with 48hrs of expiration
Items are removed from LSI and GSI automatically using an eventually consistent delete operation
Due to the difference in time between expiration and deletion, you may sometimes get expired items. Use a filter expression to return only items where TTL expiration > current_time
The TTL value must be specified as a number and in unix epoch format
DAX
Fully managed, highly available, in-memory cache for DDB that delivers up to 10x improvement
Response reduced from ms to microseconds
Runs within VPC with no exposure to the Internet
DAX Cluster is provisioned in multiple AZs for high availability. When a primary node fails, DAX fails over to a read replica and promotes it to a new primary.
API-compatible with DDB. Exposes the same APIs like GetItem, BatchGetItem, etc. Control Plane APIs are not exposes (e.g. CreateTable, DeleteTable, etc.)
Two types of cache available - an item cache and a query cache
Global Tables
You can convert your existing DynamoDB tables to global tables or specify your table to be a global table when you are creating it.
Replication latency under 1s
Encryption
In transit - HTTPS
At rest - KMS used to encrypt all table data including PK, LSIs and GSIs, Streams, Global Tables, Backups and DAX Clusters
When creating a new table you can either choose AWS owned CMK (a key owned by DDB, which is the default option and has not charges), or AWS managed CMK (key stored in aws account and managed by KMS)
DDB Encryption Client - Another encryption option where you can use Java and Python libraries that let you encrypt your data prior to sending it to DDB.
DynamoDB does not have an operation equivalent to TRUNCATE TABLE in SQL. In order to truncate a DynamoDB table, you have to perform a scan to read all the PKs and then issue a delete command. Use BatchDeleteItem to optimize the deletes.
Auditing DynamoDB table
CloudTrail - captures only control-plane API requests (used to manage the tables)
For data-plane operations (e.g. GetItem, PutItem, etc.), use DynamoDB Streams as a trigger for Lambda to have it synchronously apply your logic to the changed records
Document
Pros
Great for json-like documents
Flexible schema, where document models can change frequently to meet the application's needs over time
Great for blog platforms, storing catalog info, etc.
Cons
not a good fit for highly relational data
Services
DocumentDB
MongoDB compatible
Query json documents and aggregate data across many json documents
Leverages cluster-style architecture, similar to Neptune
Fully managed and supports PITR
Storage autoscales in increments of 10GB
Common Commands
Create a user: db.createUser({...})
Assign a role to the user: db.grantRolesToUser({...})
Revoke a role: db.revokeRolesFromUser({...})
If the primary instance exhibits high CPU utilization, but the replica instances don't, distribute read traffic across replicas via client read preference settings (for example, secondaryPreferred)
Use DocumentDB Profiler to examine execution times of operations performed on the cluster
currentOp MongoDB command can be used to list all queries that are currently executing or are blocked
In-Memory
Pros
Extremely fast, with microsecond response time. Data not stored on disk, but fully managed in memory.
Great as caching mechanisms
Cons
Not a good fit for use cases where the data needs to be persisted over a longer period of time.
Services
Elasticache
A node is RAM stored in a fixed size and it's the smallest building block of ElastiCache.
Each node runs Memcached or Redis and has its own DNS name and port. You can select the node type with varying memory.
Redis shard is a subset of the cluster's keyspace, that can include a primary node and zero or more read replicas (similar concept to RDS read replicas)
The shards add up to form a cluster
Memcached
Designed for basic use cases where cache is required (it's simpler than Redis)
Automatic detection and recovery from cache node failures
Automatic discovery of nodes within a cluster
Flexible AZ placement of nodes and clusters
Integration with other AWS services
Redis
Automatic recovery from cache node failures
Multi-AZ for a failed primary cluster to a read replica
Supports partitioning your data across up to 90 shards
More complex and feature-packed than Memcached
Global Datastore for Redis feature enables read replicas to be created across multiple regions
Redis Sorted Sets feature moves the computational complexity of leaderboards from your app to your Redis cluster. Each time a new element is added to the sorted set, it's reranked in real time.
Replication Mode
Cluster Model Disabled - always has a single shard with up to 5 read replicas nodes. Data not partitioned. Scaling approach. Needs large instances as nodes. A good solution when you have a read-heavy app performing frequent reads and having an unpredictable load.
Cluster Mode Enabled - it can have up to 250 shards with 1 to 5 read replica nodes in each shard. Data is partitioned. Partitioning approach. A good solution when you have a write-heavy application performing frequent writes
You can create a cluster with higher number of shards and lower number of replicas totaling up to 90 nodes per cluster. This cluster configuration can range from 90 shards and 0 replicas to 15 shards and 5 replicas, which is the maximum number of replicas allowed.
The replication structure is contained within a shard (called node group in the API/CLI) which is contained within a Redis cluster.
Custom Parameter Group can be created and assigned to the cluster. For instance, if you want to increase the amount of reserved memory on your cluster, create a custom parameter group with memory-reserved-percent parameter set to 50 and apply the custom parameter group to the cluster.
To alleviate the memory load, you can create backups from a read replica
Caching strategies
Lazy loading - loads data into the cache only when necessary (only on cache miss)
Write-through - adds data or updates data in the cache whenever data is written to the database.
Adding TTL - an integer value that specifies the number of seconds until the key expires.
To manually promote a read-replica to a primary node in cluster-mode disabled ElastiCache cluster, Multi-AZ with Automatic Failover must be disabled first
Ledger
Pros
Highly scalable
Data stored in immutable form (once written, no changes are allowed)
Transparent in showing the states of data
Verifiable source of data, where all the aspects of data can easily be validated (e.g. insurance claims, finance applications, etc.)
Cons
Services
QLDB
Immutable and verifiable history (can't be deleted or modified)
Serverless, auto-scales
Amazon Ion document model is supported and the documents stored in Amazon Ion format can be queries using a SQL-like language called PartiQL.
Just like in Aurora, data is replicated 6 times, across 3 different AZs
Time Series
Pros
Great for time-ordered data
Great for monitoring, storing large amounts of sensor data, etc.
Cons
Supports only inserts since the data is expected to arrive ordered by time)
Not a good fit for storing data that is not ordered by time