Azure Well-Architected Framework

Reliability

Principles

Design for scale out

Design for failure

Design for Self-Healing

Observe application health

Design for business requirements (SLA 99.99%)

Drive automation

Patterns

High Availability
(percentage of uptime)

Deployment Stamps

Deploy multiple independent copies of application components, including data stores

Geodes

Deploy backend services into a set of geographical nodes, each of which can service any client request in any region

Health Endpoint
Monitoring

Implement functional checks in an application that external tools can access through exposed endpoints at regular intervals

Queue-based
Load Leveling

Use a queue that acts as a buffer between a task and a service that it invokes, to smooth intermittent heavy loads

Throttling

Control the consumption of resources

Bulkhead

Isolate elements of an application into pools so that if one fails, the others will continue to function

Circuit Breaker

Handle faults that might take a variable amount of time to fix when connecting to a remote service or resource

Resilience
(ability to gracefully handle and recover from failures)

Bulkhead

Isolate elements of an application into pools so that if one fails, the others will continue to function

Circuit Breaker

Handle faults that might take a variable amount of time to fix when connecting to a remote service or resource

Compensating
Transaction (SAGA?)

Undo the work performed by a series of steps, which together define an eventually consistent operation

Health Endpoint Monitoring

Implement functional checks in an application that external tools can access through exposed endpoints at regular intervals

Leader Election

Coordinate the actions performed by a collection of collaborating task instances in a distributed application by electing one instance as the leader that assumes responsibility for managing the other instances

Queue-based
Load Leveling

Use a queue that acts as a buffer between a task and a service that it invokes, to smooth intermittent heavy loads

Retry

Enable an application to handle anticipated, temporary failures when it tries to connect to a service or network resource by transparently retrying an operation that's previously failed.

Scheduled Agent Supervisor

Coordinate a set of actions across a distributed set of services and other remote resources

Performance Efficiency

Principles

Design for horizontal scaling

Define a capacity model according to the business requirements

  • Test the limits for predicted and random spikes and fluctuations in load

Use PaaS offerings

Choose the right resources and right-size

Apply strategies in your design early

  • strive for stateless application
  • store state externally in a database or distributed cache
  • use caching where possible

Shift-left on performance testing

Run load and stress tests

Establish performance baselines

Run the test in the continuous integration (CI) build pipeline

Continuously monitor for performance in production

Monitor the health of the entire solution

Reevaluate the needs of the workload continuously

Checklist

Application design

Design for scaling

Scale as a unit

Take advantage of platform autoscaling features

Partition the workload

Avoid client affinity (aka stateless)

Offload CPU-intensive and I/O-intensive tasks as background tasks

Data management

Use data partitioning

Design for eventual consistency

Reduce chatty interactions between components and services

  • where possible, combine several related operations into a single request
  • use stored procedures in databases to encapsulate complex logic, and reduce the number of round trips and resource locking

Use queues to level the load for high velocity data writes

  • Use a queue that acts as a buffer between a task and a service that it invokes
  • This can smooth intermittent heavy loads that may otherwise cause the service to fail or the task to time out

Minimize the load on the data store

  • The data store is commonly a processing bottleneck, a costly resource, and often not easy to scale out
  • Typically, it's much easier to scale out the application than the data store, so you should attempt to do as much of the compute-intensive processing as possible within the application

Minimize the volume of data retrieved

Aggressively use caching

Handle data growth and retention

Optimize Data Transfer Objects (DTOs) using an efficient binary format

  • DTOs are passed between the layers of an application many times

Enable client side caching

Consider denormalizing data

  • Consider if some additional storage volume and duplication is acceptable in order to reduce the load on the data store

Implementation

Use asynchronous calls

Carry out performance profiling and load testing

Compress highly compressible data

Minimize the number of connections required

Send requests in batches to optimize network use

Avoid a requirement to store server-side session state where possible

Use lightweight frameworks and libraries

click to edit

Patterns

Throttling

Control the consumption of resources used by an instance of an application, an individual tenant, or an entire service.

Static Content Hosting

Sharding

Divide a data store into a set of horizontal partitions or shards.

Queue-Based Load Leveling

Use a queue that acts as a buffer between a task and a service that it invokes in order to smooth intermittent heavy loads.

Priority Queue

Prioritize requests sent to services so that requests with a higher priority are received and processed more quickly than those with a lower priority.

Materialized View

Generate prepopulated views over the data in one or more data stores when the data isn't ideally formatted for required query operations.

Index Table

Create indexes over the fields in data stores that are frequently referenced by queries.

Event Sourcing

Use an append-only store to record the full series of events that describe actions taken on data in a domain.

CQRS

Segregate operations that read data from operations that update data by using separate interfaces

Choreography

Have each component of the system participate in the decision-making process about the workflow of a business transaction, instead of relying on a central point of control.

Cache Aside

Scalability
(ability of a system to handle increased load)

Application design

Always design Stateless

Always design for Horizontal Scaling

Prefer async inter-service communication, such as message or event

Use cache where possible to avoid huge load to DB

Always use CDN for static resources

Infrastructure

Use Autoscaling to manage load increases and decreases

Plan for growth, add scale units

Every component in the infrastructure must be scalable

Database

DB Sharding

DB Replica: one write, multiple read replicas

DB Caching

Security

Principles

Plan security readiness
Strive to adopt and implement security practices in architectural design decisions and operations with minimal friction.

Design to protect confidentiality
Prevent exposure to privacy, regulatory, application, and proprietary information through access restrictions and obfuscation techniques.

Design to protect integrity
Prevent corruption of design, implementation, operations, and data to avoid disruptions that can stop the system from delivering its intended utility or cause it to operate outside the prescribed limits. The system should provide information assurance throughout the workload lifecycle.

Design to protect avalability
Prevent or minimize system and workload downtime and degradation in the event of a security incident by using strong security controls. You must maintain data integrity during the incident and after the system recovers.

Sustain and evolve your security posture
Incorporate continuous improvement and apply vigilance to stay ahead of attackers who are continuously evolving their attack strategies.

Checklist

Establish a security baseline that's aligned to compliance requirements, industry standards, and platform recommendations. Regularly measure your workload architecture and operations against the baseline to sustain or improve your security posture over time.

Maintain a secure development lifecycle by using a hardened, mostly automated, and auditable software supply chain. Incorporate a secure design by using threat modeling to safeguard against security-defeating implementations.

Classify and consistently apply sensitivity and information type labels on all workload data and systems involved in data processing

Create intentional segmentation and perimeters in your architecture design and in the workload's footprint on the platform. The segmentation strategy must include networks, roles and responsibilities, workload identities, and resource organization.

Implement strict, conditional, and auditable identity and access management (IAM) across all workload users, team members, and system components.

Isolate, filter, and control network traffic across both ingress and egress flows.

Encrypt data by using modern, industry-standard methods to guard confidentiality and integrity.

Protect application secrets by hardening their storage and restricting access and manipulation and by auditing those actions.

Implement a holistic monitoring strategy that relies on modern threat detection mechanisms

Establish a comprehensive testing regimen that combines approaches to prevent security issues, validate threat prevention implementations, and test threat detection mechanisms.

Define and test effective incident response procedures

Tradeoffs

Security tradeoffs with Reliability

Tradeoff: Increased complexity

Tradeoff: Increased critical dependencies

Tradeoff: Increased complexity of disaster recovery

Tradeoff: Increased rate of change

Security tradeoffs with Cost Optimization

Tradeoff: Additional infrastructure

Tradeoff: Increased demand on infrastructure.

Tradeoff: Increased process and operational costs

Security tradeoffs with Operational Excellence

Tradeoff: Complications in observability and serviceability

Tradeoff: Decreased agility and increased complexity

Tradeoff: Increased coordination efforts.

Security tradeoffs with Performance Efficiency

Tradeoff: Increased latency and overhead

Tradeoff: Increased chance of misconfiguration

Operational Excellence

Principles

Embrace DevOps culture

Establish development standards

Evolve operations with observability

Deploy with confidence
Reach the desired state of deployment with predictability.

Automate for efficiency

Adopt safe deployment practices
Implement guardrails in the deployment process to minimize the effect of errors or unexpected conditions.