Well Architected Framework
5 Pillars

Operations Excellence

  1. Monitor Systems
  2. Proactively Detect Issues
  3. Document Procedures
  4. Small Changes & Test
  5. Automate Process
  6. Never Stop Improving

Design Principles

  1. Perform Operations as Code
  2. Documentation (Annotation)
  3. Make frequent small reversible changes
  4. Refine Operations Procedures Frequently
  5. Anticipate Failure
  6. Learn from all Operational Failures

Best Practices

  1. Prepare
  2. Operate
  3. Evolve

Prepare

  1. CloudFormation + CloudWatch Metrics & Alarms
  2. Standardise, test and approve AMIs & Resources
  3. Game days, break things, test procedures, Train staff image

Operate

  1. CloudWatch
  2. CloudTrail
  3. Flowlogs
  4. X-Ray
    image

Evolve

  1. ElasticSearch (Non AWS)
  2. Devops
    • CodeCommit, CodeBuild/Test, CodeDeploy, Pipeline, CodeStar
  3. X-Ray
    image

Reliability Pillar
The ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures.

Reliability Design Principles

  1. Test Recovery Procedures (RTO, RPO)
  2. Automatically Recovery from Failures (Self Healing)
  3. Scale Horizontally
  4. Stop Guessing Capacity
  5. Automate Change :

Reliability Best Practices

  1. Foundations (Security & Limits)
  2. Change Management (Monitoring & Scaling)
  3. Failure Management (Automate + Durable + Key)

Foundations (Security & Limits)

  1. Limit Access
  2. Isolate Resources
  3. Safeguard Applications
    image

Change Management (Monitoring & Scaling)

  1. Monitor AWS APIs
  2. Automatically Scale
  3. Monitor Key Metrics
    image

Failure Management (Automate + Durable + Key)

  1. Disaster Recovery Strategy
  2. Maintain Backups
    image
  1. Test Recovery Procedures (RTO, RPO)
    image
  1. Automatically Recovery from Failures (Self Healing)
    image
  1. Scale Horizontally
  2. Stop Guessing Capacity
    image
  1. Automate Change
    image

Reliability Best Practices
Disaster Recovery Design Patterns

  1. Backup & Restore
  2. Pilot Light
  3. Low Capacity Restore
  4. Multi-Site Active-Active**

Backup & Restore

  1. CloudFormation Template & DR Region
  2. Restore data from backup (RTO)
  3. Modify DNS

Pilot Light

  1. Instances DR region but powered off
  2. DB read-only replica on small instance
  3. Spin up Instances & Replication Lag (RTO)

Low Capacity Standby

  1. Some instances running (not all)
  2. Multi-master DB Replication (Synchronous)
  3. Autoscaling Group for instances (RTO)
  4. Route 53: Weighted: 95% | 5%

Multi-Site Active-Active

  1. 2x regions with capacity to take on 100%
  2. Route 53 weighted: 50% | 50%
  3. 50% or less utilisation in any region
  4. Route 53 detects failure 0% | 100% (RTO)

Security Pillar
The ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies.

Security Design Principles

  1. Implement Strong Identity Foundation
  2. Enable Traceability
  3. Apply Security at Every Layer
  4. Automate Security
  5. Protect Data in Transit & at Rest
  6. Prepare for Security Events

Implement Strong Identity Foundation

  1. Nobody uses Root Account
  2. Every user has their own IAM account
  3. Principle of Least Privileges
  4. Access keys shared or posted on GitHub

Enable Traceability

  1. Enable CloudTrail and see API calls users make

Apply Security at Every Layer

  1. SGs, NACLs, WAF, Instance Firewall
  2. Front with AWS Services (CloudFront + ALB)
  3. Use VPC controls and private subnets

Automate Security

  1. Cloudformation templates should include security
  2. AMIs hardened and approved in CFN template
  3. Prevent other AMIs from launching

Protect Data in Transit & at Rest

  1. Use SSL
  2. Server Side encryption for S3, Kinesis, DynamoDB, SQS, etc

Prepare for Security Events

  1. CloudWatch Events
  2. Notify Stakeholders
  3. Trigger Lambda Function to lock down environment or load new environment and failover to it

Security Best Practices

  1. Identity & Access Management
  2. Detective Controls
  3. Infrastructure Protection
  4. Data Protection
  5. Incident Response

Identity & Access Management

  1. IAM (Access Control)
  2. AWS Organisations (Centrally Manage Accounts)
  3. MFA (Identity Authentication)
  4. Simple Token Service (Limited Life Credentials)
    image

Detective Controls

  1. CloudTrail (API Access Logs)
  2. AWS Config (Resource Inventory)
  3. CloudWatch (Logs Metrics Filters)
  4. GuardDuty (Threat Detection)
    image

Infrastructure Protection

  1. VPC (Isolated Virtual Networks)
  2. Inspector (Vulnerability Detection)
  3. Shield (DDOS Mitigation)
  4. WAF (Application Firewall)
    image

Data Protection

  1. Macie (Data Security Automation)
  2. S3 (Object Encryption)
  3. EBS (Block Encryption)
  4. KMS (Key Encryption Management)
    image

Incident Response

  1. CloudFormation (Infrastructure as Code)
  2. IAM (Response team authorisation)
    image

Performance Efficiency
ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.

Performance Efficiency Design Principles

  1. Democratise Advanced Technologies
  2. Go Global in Minutes
  3. Use Serverless Architectures
  4. Experiment more Often
  5. Mechanical Sympathy

Performance Efficiency Best Practices

  1. Selection
  2. Review
  3. Monitor
  4. Tradeoffs

Democratise Advanced Technologies

  1. Before you DIY, check if there is a managed service
  2. PAYG, Highly Available, Fault Tolerant, Already built
    Ex. Your own NoSQL DB when there is already DynamoDB

Go Global in Minutes

  1. Leverage AWS Global Reach (Regions around the world)
  2. Optimise Experience for Users where they are in the World

Use Serverless Architectures

  1. No upfront costs, Only pay for what you use
  2. Don't pay for idle servers, maintenance and licences

Experiment More Often

  1. Try new features, new services and new instance types that become available from AWS
  2. New advancements can optimise existing or old architectures

Mechanical Sympathy

  1. Being aware of the different options available to accomplish something
  2. Know where they excel and where they don't
    Ex. Storage (S3, EBS, EFS, Glacier), which meet your objectives?

Selection

  1. Compute (Autoscaling)
  2. Storage (EBS, S3)
  3. Databases (DynamoDB, RDS)
  4. Network (VPC, Route 53, DX)

Review

  1. AWS Blog & Whats New

Monitor

  1. CloudWatch (Metrics, Alarms, Notifications)
  2. Lambda (Automated Actions)

Tradeoffs

  1. CloudFront (Global Caching)
  2. ElastiCache (Request Offloading)
  3. Snowball (Data Migration)
  4. RDS (Read Replicas)

Cost Optimisation
The ability to run systems to deliver business value at the lowest price point.

Cost Optimisation Design Principles

  1. Adopt a Consumption Model
  2. Measure Overall Efficiency
  3. Stop spending money on Data Centres
  4. Analyse and Attribute Expenditure
  5. Use Managed Services

Cost Optimisation Best Practices

  1. Cost Effective Resources
  2. Matching Supply & Demand
  3. Expenditure Awareness
  4. Optimising Over Time

Adopt a Consumption Model

  1. Only pay for what you use
  2. Check for under-utilised or low traffic instances
  3. Write scripts to start and stop instances when needed

Measure Overall Efficiency

  1. Use CloudWatch to identify underutilised or unused resources in your account.
    Ex. < 2% CPU utilisation over 2 days.
    Ex. < 2% connections to database over 2 days

Stop Spending Money on Data Centres

  1. Expand onto the cloud instead of buying more hardware

Analyse and Attribute Expenditure

  1. Use Tags to attribute expenditure to projects & departments
  2. Use cost reports to charge back project, departments or customers

Use Managed Services
Why setup your own email server when there is a managed service that is more cost effective

Cost Effective Resources

  1. Instances (Reserved & Spot)
  2. Tags (Cost allocation tags)

Matching Supply & Demand
CloudWatch > AutoScaling (Scale in when demand drops)

Expdenditure Awareness
CloudWatch > SNS (Notification when costs exceed budget)
Tags (Cost allocation tags)

Optimising Over Time
Trusted Advisor (Weekly email report)
AWS Blog & Whats New

What is Amazon Inspector?

  1. Agent based security scans
  2. Rules Knowledge base
    • Common Vulnerabilities
    • Best Practices
    • Center for Internet Security (CIS) Benchmarks
    • Runtime behaviour analysis
    • Insecure ports and protocols
  3. Example Findings
    • Remove root login enabled
    • Vulnerable software versions

What is Amazon Macie?

  1. Machine learning based data classification
  2. Data stored in S3 (more datastores coming)
  3. Risky & Suspicious Activity
  4. Alerts
    • High risk data events
    • Credentials in source code
    • Unencrypted backups
    • Signs of potential attack