Please enable JavaScript.
Coggle requires JavaScript to display documents.
HA, DR and Scalability - Coggle Diagram
HA, DR and Scalability
-
Launch Templates
- Specifies all of the needed settings that go into building out an EC2 instance.
- It is a collection of settings that you can configure so you do not have to walk though the EC2 wizard over and over
- It includes: AMI, EC2 Instance Type, SG, user data, potentially network
-
If we include network information, we cannot use the Launch Template for Autoscale Group
Create a template:
- from scratch
- new version of an existing template
- copy the parameters from a launch configuration, running instance, or other template
- you can configure same parameters you set at EC2 luanch
-
Auto Scaling Groups
Key Elements
ELB Configuration EC2 instances can be registered with a ELB (existing or created with the auto scaling group). The auto scaling group can be set with the ELB health check to terminate/replace unhealthy instances you need to request this explicitly
Set Scaling Policies Minimum, Maximum and Desired capacity needs to be set to ensure you do not have too few/many instances
-
-
-
-
Auto Scaling Policies
Type of Scaling
Dynamic Scaling measure load and determine if more capacity is needed (reactive). The following policies are supported:
- Target tracking scaling
- Step Scaling
- Simple Scaling
Step Scaling
-
Warm-Up (stops instances from being placed behind ELB, failing the health check, and get terminated). During warmup, scale-in is blocked. Warming up instance is not counted toward the aggregated metrics of the ASG
Increase or decrease (absolute or %) or set exact size the current capacity of the group based on a set of scaling adjustments, known as step adjustments, that vary based on the size of the alarm breach.
Simple Scaling
Cooldown pauses Auto Scaling for a set amount of time. Helps to avoid runaway scaling events. Default is 5 minutes
Increase and decrease the current capacity of the group based on a single scaling adjustment, with a cooldown period between each scaling activity.
Target Tracking Scaling
Increase and decrease the current capacity of the group based on a Amazon CloudWatch metric and a target value. It works similar to the way that your thermostat maintains the temperature of your home—you select a temperature and the thermostat does the rest.
Metrics that decrease when capacity increases and vice versa can be used to proportionally scale out or in the number of instances using target tracking (e.g. custom metric .. per instance)
Includes Warm-up parameters. You can set default instance warmup to avoid setting it in the scaling policy (target tracking, step and instance refresh)
Scheduled Scaling if you have a predictable workload, create scaling event to get your resources ready before they are actually needed.
- Min/Desired/Max
- Recurrence (once/cron/every...)
- Start Date Time
- (optional) End Date Time
- Predictive Scaling
- AI/ML to determine when you need sto scale
- This is re-evaluated every 24 hours to create a forecast for the next 48 hours
- Needs at least 2 weeks of data
Manual Scaling At any time, you can change the size of an existing Auto Scaling group manually. You can either update the desired capacity of the Auto Scaling group, or update the instances that are attached to the Auto Scaling group. Not always automatic scaling is needed
Steady state Auto Scaling Group this si when Max,Min and Desired are all set to 1 you cannot have multiple copy of EC2 but still need HA
- Bare in mind cost consideration when dealing with auto scaling group
- You will be given scenarios where you'll need to know the cost implications and reasons why you might not want change Auto Scaling Group parameters
Best Practices
Keep an eye on provisioning time, configure AMIs to minimize it
-
-
-
-
Lifecycle Hooks
Let you create solutions that are aware of events in the Auto Scaling instance lifecycle, and then perform a custom action on instances when the corresponding lifecycle event occurs
Use Cases
- in scale-in event pause the instance termination for a certain amount of time to allow the EC2 to upload all data logs before it gets completely terminated
- control when instances are registered with Elastic Load Balancing. By adding a launch lifecycle hook to your Auto Scaling group, you can ensure that your bootstrap scripts have completed successfully and the applications on the instances are ready to accept traffic before they are registered to the load balancer at the end of the lifecycle hook
- Complete the lifecycle action with result = CONTINUE to finish before the timeout expires. If you don't complete the lifecycle action, the hook goes to the status specified for Default result after the timeout period ends
- use aws autoscaling complete-lifecycle-action --lifecycle-action-result CONTINUE to either manually or automatically complete the lifecycle action
- By default, when you add a lifecycle hook in the console, Amazon EC2 Auto Scaling sends lifecycle event notifications to Amazon EventBridge
- Using EventBridge or a user data script is a recommended best practice
- To create a lifecycle hook that sends notifications directly to Amazon SNS or Amazon SQS, use the AWS CLI, AWS CloudFormation, or an SDK to add the lifecycle hook
Amazon EC2 Auto Scaling honors cooldown periods when using simple scaling policies, but not when using other scaling policies or scheduled scaling. A default cooldown period automatically applies to any scaling activities for simple scaling policies, and you can optionally request to have it apply to your manual scaling activities
- Auto Scaling Group only for EC2
- A collection of instances that are treated as a collective group for purposes of scaling and management
- Autoscaling is vital to creating a high available app, you need to opt for architecture that spread resources across multiple AZs and uses ELBs
- Availability Zone distribution
- Balanced best effort: If launches fail in a AZ, ASG will attempt to launch in another healthy AZ
- Balanced only: If launches fail in a AZ, ASG will continue to attempt to launch in the unhealthy AZ to preserve balanced distribution
- You can manually attach a running instance to an autoscaling scaling group
- You can manually detach a InService instance from an autoscaling scaling group
Default Termination Policy:
- Determines which AZ have the most instances and then identifies at least one instance that is not scale-in protected
- Determine whether any of the instances eligible for termination use the oldest launch template or launch configuration (terminates instances that use a launch configuration before those with a launch template)
- After applying the preceding criteria, if there are multiple unprotected instances to terminate, determine which instances are closest to the next billing hour. If there are multiple unprotected instances closest to the next billing hour, terminate one of these instances at random
If the Availability Zones have an equal number of instances, Amazon EC2 Auto Scaling checks for the oldest launch configurationTermination policy here
Control instance termination here
Warm Pool
Instances in the warm pool in one of three states: Stopped, Running, or Hibernated
-
-
-
Scale-in Protection
- prevents instances from being terminated during scale-in events
- if you enable instance (scale-in) protection on an existing Auto Scaling group, all new instances launched will have instance scale-in protection
- you can disable the scale-in protection setting on individual instances
- ASG Instance Protection is a specific instruction to the Auto Scaling service, not a lock on the instance itself
- does not guarantee that instances won't be terminated in the event of a human error (e.g. someone manually terminates the instance terminate-instance-in-auto-scaling-group command)
- EC2 termination protection doesn't prevent termination due to scale-in event, it only protects your instance from accidental termination
- even with termination protection and instance scale-in protection enabled, data saved to instance storage can be lost if a health check fails
- if you only enable termination protection this does not prevent Amazon EC2 Auto Scaling from terminating instances
-
Health Checks
Health check type
- Amazon EC2 status checks and scheduled events
- Default in ASG
- Checks that the instance is running
- EC2 instance status checks and system status checks
- Checks for underlying hardware or software issues that might impair the instance
- Turn on EBS health checks (EBS monitors if EC2 root volume / attached volume stalls. In case of alarm, ASG replaces EC2)
- If instance is affected by a scheduled event, ASG considers the instance to be unhealthy and replaces it according timestamp of the event
- ELB health checks
- Checks whether the load balancer reports the instance as healthy, confirming whether the instance is available to handle requests
- To run this health check type, you must enable it for your ASG
- If connection draining (deregistration delay) is enabled, ASG waits for either in-flight requests to complete or the max timeout to expire before it terminates unhealthy instances
- VPC Lattice health checks
- Checks whether VPC Lattice reports the instance as healthy, confirming whether the instance is available to handle requests
- To run this health check type, you must enable it for your ASG
- Custom health checks
- Checks for any other problems that might indicate instance health issues, according to your custom health checks
- The health status of an ASG instance indicates whether it's healthy or unhealthy
- All instances in your ASG start with a Healthy status
- Instances are assumed to be healthy unless ASG receives notification that they are unhealthy
- This notification can come from sources such as Amazon EC2, Elastic Load Balancing, VPC Lattice, or custom health checks
- When ASG detects an unhealthy instance, it terminates it and launches a new one
Health check grace period (ELB Target Group) - This time period delays the first health check until your instances finish initializing. It doesn't prevent an instance from terminating when placed into a non-running state
Rebalance
AZ Rebalance
- EC2 Auto Scaling automatically rebalances the Auto Scaling group. It does this by launching instances in the enabled AZ with the fewest instances and terminating instances elsewhere
- The following actions can lead to rebalancing activity:
- You change the AZ associated with your Auto Scaling group
- You explicitly terminate or detach instances or place instances in standby, and then the group becomes unbalanced
- An AZ that previously had insufficient capacity recovers and now has additional capacity
- An AZ that previously had a Spot price above your maximum price now has a Spot price below your maximum price
- When rebalancing new instances are launched before terminating others (does not compromise the performance or availability)
- Being at or near the specified maximum capacity could impede or completely halt rebalancing activities
- To avoid this problem, the system can temporarily exceed the specified maximum capacity of a group during a rebalancing activity (the greater between 10% or one instance)
Suspend AZ Rebalance
- A scale-out or scale-in event occurs, the scaling process still tries to balance the Availability Zones. For example, during scale-out, it launches the instance in the Availability Zone with the fewest instances
- If you suspend the Launch process, AZ Rebalance neither launches new instances nor terminates existing instances. This is because AZRebalance terminates instances only after launching the replacement instances
- If you suspend the Terminate process, your Auto Scaling group can grow up to 10% larger than its maximum size because this is allowed temporarily during rebalancing activities. If the scaling process cannot terminate instances, your Auto Scaling group could remain above its maximum size until you resume the Terminate process
Capacity Rebalance
- When using Spot Instances you can turn on Capacity Rebalancing
- Attempt to launch a Spot Instance whenever EC2 reports that a Spot Instance is at an elevated risk of interruption
- After launching a new instance, it then terminates an earlier instance
Suspend and resume a process for an Auto Scaling group
(more here)
Types of processes
- Launch—Adds instances to ASG when the group scales out, or when EC2 Auto Scaling chooses to launch instances for other reasons, such as when it adds instances to a warm pool
- Terminate—Removes instances from the ASG when the group scales in, or when EC2 Auto Scaling chooses to terminate instances for other reasons, such as when an instance is terminated for exceeding its maximum lifetime duration or failing a health check
- AddToLoadBalancer—Adds instances to the attached load balancer target group or Classic Load Balancer when they are launched
- AlarmNotification—Accepts notifications from CloudWatch alarms that are associated with dynamic scaling policies
- AZRebalance—Balances the number of EC2 instances in the group evenly across all of the specified Availability Zones when the group becomes unbalanced, for example, when a previously unavailable Availability Zone returns to a healthy state
- HealthCheck—Checks the health of the instances and marks an instance as unhealthy if Amazon EC2 or Elastic Load Balancing tells Amazon EC2 Auto Scaling that the instance is unhealthy. This process can override the health status of an instance that you set manually
- InstanceRefresh—Terminates and replaces instances using the instance refresh feature
- ReplaceUnhealthy—Terminates instances that are marked as unhealthy and then creates new instances to replace them
- ScheduledActions—Performs the scheduled scaling actions that you create or that are created for you when you create an AWS Auto Scaling scaling plan and turn on predictive scaling
Considerations
- You can suspend and resume individual processes or all processes
- Suspending a process affects all instances in your Auto Scaling group
- Suspending AlarmNotification allows you to temporarily stop the group's target tracking, step, and simple scaling policies without deleting the scaling policies or their associated CloudWatch alarms
- If you suspend the Launch and Terminate processes, or AZRebalance, and then you make changes to your Auto Scaling group (e.g. detaching instances or changing the AZs) your group can become unbalanced between Availability Zones. If that happens, after you resume the suspended processes, Amazon EC2 Auto Scaling gradually redistributes instances evenly between AZs
- Suspending the Terminate process doesn't prevent the successful termination of instances using the force delete option with the delete-auto-scaling-group command
-
-
-
Scaling DynamoDB Options
Table Capacity Modes
-
Provisioned
-
-
Requires effort to review past usage and setup upper/lower scaling limits, better to enable the auto-scaling mode
- Minimum capacity units
- Maximum capacity units
- Target Utilization %
- Initial provisioned units
You must specify table provisioned throughput capacity: amount of read and write activity that the table can support. DynamoDB uses this information to reserve system resources to meet your throughput needs
Optionally enable auto scaling to manage your table's throughput capacity. You still must provide initial settings for read and write capacity when you create the table. Auto scaling uses initial settings as a starting point, and then adjusts them dynamically
-
- DynamoDB Tables require a Partition Key. Avoid hot keys that is keys with similar values that leads to hot partitions
- Using always the same value for partition key will impact performance
-
4 Questions in Mind
Which is appropriate in this scenario: horizontal or vertical? Generally favour Horizontal which also bring HA by spreading across AZs, but in some scenario vertical still applies
Is it the scaling cost effective? keep this in mind even if the question do not refer to cost-effectiveness. Always balance cost in the architecture
Is it High available? have this in mind even if the question do not specifically mention HA. Do not consider HA only if the question rules out HA explicitly
Would switching DB fix the problem? real life switching DB is painful, but in the exam you can easily go with this option
-
Disaster Recovery
more here
Your workload must perform its intended function correctly and consistently. To achieve this, you must architect for Resiliency that is the ability of a workload to:
- recover from infrastructure, service, or application disruptions
- dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues
Disaster Recovery (DR) is an important part of your resiliency strategy and concerns how your workload responds when a disaster strikes. This response must be based on your organization's business objectives which specify your workload's strategy for avoiding loss of data, Recovery Point Objective (RPO), and reducing downtime, Recovery Time Objective (RTO).
DR can be compared to availability, which is another important component of your resiliency strategy. Whereas disaster recovery measures objectives for one-time events, availability objectives measure mean values over a period of time:
- Resiliency
- DR
- RTO - How quickly must you recover? What is the cost of downtime?
- RPO - How much data can you afford to recreate or loose?
- Availability
Availability = MTBF / (MTBF+MTTR) or Successful Responses / Valid Request
High availability is not disaster recovery. DR has different objectives from Availability, measuring time to recovery after the larger scale events that qualify as disasters. You should first ensure your workload meets your availability objectives
Single AWS Region
- for a disaster event based on disruption or loss of one physical data center, implementing a highly available workload in multiple AZs within a single AWS Region helps mitigate against natural and technical disasters
- for extra assurance with your single-Region deployment, you can back up data and configuration (including infrastructure definition) to another Region
- this strategy reduces the scope of your disaster recovery plan to only include data backup and restoration
- in case of regulatory data residency requirements, then in addition to designing multi-AZ workloads for high availability as discussed above, you can also use the AZs within that Region as discrete locations
Multiple AWS Regions
Pilot Light (RPO/RTO: 10s of minutes)
- Data live (e.g Database, S3, EFS, ...) & Services/Apps switched-off
- Provision core AWS resources and scale after event
- €€
Warm Standby (RPO/RTO: minutes)
- Business Critical
- Always running, but smaller
- Promote DB to primary and scale resources/instances after event
- €€€
Backup and Restore (RPO/RTO: hours)
- Low priority use cases
- Provision all AWS resources after event
- Restore backups after event
- €
Multi-site Active/Active (RPO/RTO: real-time/seconds)
- Mission Critical Services
- Zero downtime
- Near zero data loss (data disaster e.g. corruption, deletion, or obfuscation will always have RTO>0 and RPO some point before the disaster)
- Cost €€€€
-