Well Architected Framework
5 Pillars
Operations Excellence
- Monitor Systems
- Proactively Detect Issues
- Document Procedures
- Small Changes & Test
- Automate Process
- Never Stop Improving
Design Principles
- Perform Operations as Code
- Documentation (Annotation)
- Make frequent small reversible changes
- Refine Operations Procedures Frequently
- Anticipate Failure
- Learn from all Operational Failures
Best Practices
- Prepare
- Operate
- Evolve
Prepare
- CloudFormation + CloudWatch Metrics & Alarms
- Standardise, test and approve AMIs & Resources
- Game days, break things, test procedures, Train staff
Operate
- CloudWatch
- CloudTrail
- Flowlogs
- X-Ray
Evolve
- ElasticSearch (Non AWS)
- Devops
- CodeCommit, CodeBuild/Test, CodeDeploy, Pipeline, CodeStar
- X-Ray
Reliability Pillar
The ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures.
Reliability Design Principles
- Test Recovery Procedures (RTO, RPO)
- Automatically Recovery from Failures (Self Healing)
- Scale Horizontally
- Stop Guessing Capacity
- Automate Change :
Reliability Best Practices
- Foundations (Security & Limits)
- Change Management (Monitoring & Scaling)
- Failure Management (Automate + Durable + Key)
Foundations (Security & Limits)
- Limit Access
- Isolate Resources
- Safeguard Applications
Change Management (Monitoring & Scaling)
- Monitor AWS APIs
- Automatically Scale
- Monitor Key Metrics
Failure Management (Automate + Durable + Key)
- Disaster Recovery Strategy
- Maintain Backups
- Test Recovery Procedures (RTO, RPO)
- Automatically Recovery from Failures (Self Healing)
- Scale Horizontally
- Stop Guessing Capacity
- Automate Change
Reliability Best Practices
Disaster Recovery Design Patterns
- Backup & Restore
- Pilot Light
- Low Capacity Restore
- Multi-Site Active-Active**
Backup & Restore
- CloudFormation Template & DR Region
- Restore data from backup (RTO)
- Modify DNS
Pilot Light
- Instances DR region but powered off
- DB read-only replica on small instance
- Spin up Instances & Replication Lag (RTO)
Low Capacity Standby
- Some instances running (not all)
- Multi-master DB Replication (Synchronous)
- Autoscaling Group for instances (RTO)
- Route 53: Weighted: 95% | 5%
Multi-Site Active-Active
- 2x regions with capacity to take on 100%
- Route 53 weighted: 50% | 50%
- 50% or less utilisation in any region
- Route 53 detects failure 0% | 100% (RTO)
Security Pillar
The ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies.
Security Design Principles
- Implement Strong Identity Foundation
- Enable Traceability
- Apply Security at Every Layer
- Automate Security
- Protect Data in Transit & at Rest
- Prepare for Security Events
Implement Strong Identity Foundation
- Nobody uses Root Account
- Every user has their own IAM account
- Principle of Least Privileges
- Access keys shared or posted on GitHub
Enable Traceability
- Enable CloudTrail and see API calls users make
Apply Security at Every Layer
- SGs, NACLs, WAF, Instance Firewall
- Front with AWS Services (CloudFront + ALB)
- Use VPC controls and private subnets
Automate Security
- Cloudformation templates should include security
- AMIs hardened and approved in CFN template
- Prevent other AMIs from launching
Protect Data in Transit & at Rest
- Use SSL
- Server Side encryption for S3, Kinesis, DynamoDB, SQS, etc
Prepare for Security Events
- CloudWatch Events
- Notify Stakeholders
- Trigger Lambda Function to lock down environment or load new environment and failover to it
Security Best Practices
- Identity & Access Management
- Detective Controls
- Infrastructure Protection
- Data Protection
- Incident Response
Identity & Access Management
- IAM (Access Control)
- AWS Organisations (Centrally Manage Accounts)
- MFA (Identity Authentication)
- Simple Token Service (Limited Life Credentials)
Detective Controls
- CloudTrail (API Access Logs)
- AWS Config (Resource Inventory)
- CloudWatch (Logs Metrics Filters)
- GuardDuty (Threat Detection)
Infrastructure Protection
- VPC (Isolated Virtual Networks)
- Inspector (Vulnerability Detection)
- Shield (DDOS Mitigation)
- WAF (Application Firewall)
Data Protection
- Macie (Data Security Automation)
- S3 (Object Encryption)
- EBS (Block Encryption)
- KMS (Key Encryption Management)
Incident Response
- CloudFormation (Infrastructure as Code)
- IAM (Response team authorisation)
Performance Efficiency
ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.
Performance Efficiency Design Principles
- Democratise Advanced Technologies
- Go Global in Minutes
- Use Serverless Architectures
- Experiment more Often
- Mechanical Sympathy
Performance Efficiency Best Practices
- Selection
- Review
- Monitor
- Tradeoffs
Democratise Advanced Technologies
- Before you DIY, check if there is a managed service
- PAYG, Highly Available, Fault Tolerant, Already built
Ex. Your own NoSQL DB when there is already DynamoDB
Go Global in Minutes
- Leverage AWS Global Reach (Regions around the world)
- Optimise Experience for Users where they are in the World
Use Serverless Architectures
- No upfront costs, Only pay for what you use
- Don't pay for idle servers, maintenance and licences
Experiment More Often
- Try new features, new services and new instance types that become available from AWS
- New advancements can optimise existing or old architectures
Mechanical Sympathy
- Being aware of the different options available to accomplish something
- Know where they excel and where they don't
Ex. Storage (S3, EBS, EFS, Glacier), which meet your objectives?
Selection
- Compute (Autoscaling)
- Storage (EBS, S3)
- Databases (DynamoDB, RDS)
- Network (VPC, Route 53, DX)
Review
- AWS Blog & Whats New
Monitor
- CloudWatch (Metrics, Alarms, Notifications)
- Lambda (Automated Actions)
Tradeoffs
- CloudFront (Global Caching)
- ElastiCache (Request Offloading)
- Snowball (Data Migration)
- RDS (Read Replicas)
Cost Optimisation
The ability to run systems to deliver business value at the lowest price point.
Cost Optimisation Design Principles
- Adopt a Consumption Model
- Measure Overall Efficiency
- Stop spending money on Data Centres
- Analyse and Attribute Expenditure
- Use Managed Services
Cost Optimisation Best Practices
- Cost Effective Resources
- Matching Supply & Demand
- Expenditure Awareness
- Optimising Over Time
Adopt a Consumption Model
- Only pay for what you use
- Check for under-utilised or low traffic instances
- Write scripts to start and stop instances when needed
Measure Overall Efficiency
- Use CloudWatch to identify underutilised or unused resources in your account.
Ex. < 2% CPU utilisation over 2 days.
Ex. < 2% connections to database over 2 days
Stop Spending Money on Data Centres
- Expand onto the cloud instead of buying more hardware
Analyse and Attribute Expenditure
- Use Tags to attribute expenditure to projects & departments
- Use cost reports to charge back project, departments or customers
Use Managed Services
Why setup your own email server when there is a managed service that is more cost effective
Cost Effective Resources
- Instances (Reserved & Spot)
- Tags (Cost allocation tags)
Matching Supply & Demand
CloudWatch > AutoScaling (Scale in when demand drops)
Expdenditure Awareness
CloudWatch > SNS (Notification when costs exceed budget)
Tags (Cost allocation tags)
Optimising Over Time
Trusted Advisor (Weekly email report)
AWS Blog & Whats New
What is Amazon Inspector?
- Agent based security scans
- Rules Knowledge base
- Common Vulnerabilities
- Best Practices
- Center for Internet Security (CIS) Benchmarks
- Runtime behaviour analysis
- Insecure ports and protocols
- Example Findings
- Remove root login enabled
- Vulnerable software versions
What is Amazon Macie?
- Machine learning based data classification
- Data stored in S3 (more datastores coming)
- Risky & Suspicious Activity
- Alerts
- High risk data events
- Credentials in source code
- Unencrypted backups
- Signs of potential attack