Please enable JavaScript.
Coggle requires JavaScript to display documents.
Site Reliability Engineering SRE is a way of identifying system…
Site Reliability
Engineering
SRE is a way of identifying system weaknesses, testing production environments and solving those issues before they become major incidents. SRE as part of a DevOps-focused team improves the reliability of technical services through deeper collaboration and proactive optimization of redundancies and monitoring and alerting practices.
we want systems that are automatic, not just automated.
"Hope is not a strategy" -
SRE@Google
unofficial motto
Architecture
Microservices
Microservice per endpoint
High availability (end user requests)
Low availability (archiving)
Functional lane diagram
Happy flows
Unhappy flows
Infra diagram
Components
CI's
Firewalls
Proxies
Loadbalancing
State (Active/Cold/Hot/StandBy)
Certificates
API Calls
Connection Types
Outgoing Ports
Ingoing Ports
Authentication Types
Feature toggles (switches)
Web Interface
Authorisation
Authentication
Configuration file
Touch Point Archtecture
Merak
API Marketplace
Finagle (networking library)
API Marketplace
RIAF (BE/NL)
Migration to Merak
Configuration Mngt
Service Offering
Support Offering
Business applications
DTAP
Config Item (CI)
Assignment Groups
Infrastructure Portal
portal.ing.net
Function description:
what is the main purpose of the system/appl
what are the priority functions of the system/appl
Cost
divide your infra depending on latency requirement
low-latency cluster
high throughput clusters
different service guarantee per service
quantities per resources per cluster
degree of redundancy of the cluster
geogrpahical provisioning of the cluster
critically of the cluster
infra software configuration
(f.ex;less memory)
Testing
Test Container Platform
Global Selenium Grid
Test Data Mngt
Data Masking
Chaos Engineering
Chaos as a Service
Performance Testing
Load testing
(Volume estimates, capacity, and performance)
Determines the number of resources required. Determine the max load on each component separately (how many max queries per component). In combination with performance testing determine the peak and maximal load.
total max load / max queries per component) = nbr of instances required + 2 (instance failure during update)
Storage capacity
HTTP traffic and bandwidth estimates, launch “spike", traffic mix, 6 months out
Capacity per datacenter at max latency
cfr. risk management for the organic and inorganic demand
End-ot-end testing
PMD Performance Container
Test-driven development (TDD)
Behavior-Driven development (BDD)
Test Strategy
Happy Flows
Unhappy Flows
Coverage percentage
A/B testing
Test Automation
enhances delivery speed
Early detection
After each commit/build
enhances testing accurancy
enhances quality
intensively decreasing charges
Regression testing
Confidence checking
use Docker containers to run automated confidence checks during deployments to production. It is the only way we can safely deploy these applications to Production.
Penetration Testing
Risk Mngt
Rather than simply maximizing uptime, SRE seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness - with features, service, and performance - is optimized
OCD green
SDT-tool
RCEC/RCIC
pwd in PWVault
SEM-I/SEM-A
Certificate Management
Certificates
Automated deployment/implementation
Mutual Authtication
Static Code scan
Fortify > Checkmarx
100% green
integrated in all branches
Development and production versions
no separate version in Fortify
Reworked till no comments
OWASP
Always green
Sonar
integrated and automated
100% green
Demand forecasting &
Capacity planning
Organic demand forecast
natural product adoption and usage by customers
Incorporate inorganic demand
results from events like feature launches, marketing campaigns or business-driven changes
Load testing (set the organic and inorganic threshold)
Data Leakage
Availability
SREs should be people who hate manual work and who have the skill set necessary to automate said work through engineering.
Availability is measured primarily at the side of the customer.
Maximum change velocity
In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available.
*
Cost and complexity increases (min complexity is by factor 5) the close you get to 100%
Service error budget
one minus the availability target. A service that is 99.99% available is 0.01% unavailable. That permitted 0.01% unavailability is the service’s error budget.
Availability target
What percentage of availability needs to be achieved?
Not that 100% availability is impossible since all the components between the client and the service do not reach 100% either (f.ex. Laptop, network, firewall, provider,...).
Service Level Indicator (SLI)
A quantitative measure of some aspect of the level of service that is provided, like latency, error rate, availability, etc.
Service Level Indicator in any moment in time measured by percentiles on...
What is the SLI defined for the team to reach?
User-facing serving systems
Availability (#nbr of requests succeeded)
99,95% is a good start
Request Latency (duration of a request)
System throughput (requests per second)
Failure per request (type of failure/severity)
Error rate (# requests received/failed)
Storage systems
Availability (#nbr of requests succeeded)
99,95% is a good start
Request Latency (time on read/write data)
Durability (data there when we need it)
Big data systems
Throughput (how mych data is processed)
end-to-end latency (How long does it take the data to progress from ingestion to completion
Correctness (right answered returned, the right data retrieved, the right analysis done)
Service Level Objective (SLO)
A binding target level of availability for a collection of SLIs for a service level. Service Level Objectives (f.ex. uptime) on quarter or year basis and agreed with Product Owners. SLOs does not have to be beaten but rightly defined!
(lower bound <= SLI <= higher bound)
If SLI is higher then SLO then you slow down the release of features.
if SLI is lower then SLO then you break the expectations of your customer on what they are used to and will stop using your services.
SLO examples
Business/Functional target
Per component defined
Per service defined
HTTP QPS (queries per second)
Latency (duration of a query)
99.99% vs 100%
SLO Strategy selecting SLO's
Don't pick a target based on current performance
Keep it simple
Avoid absolutes
do not try to scale "infinitely" or be "always" available
Have as few SLOs as possible
Perfection can wait
SLO expectations
Keep a safety margin
Don't overachieve
Service Level Agreement (SLA)
An explicit or implicit contract with users that
includes consequences
of meeting, or missing, the SLOs that they contain
For which period and over the services?
Technical target
Emergency response
Mean Time To Repair (MTTR)
Most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health
3x improved MTTR with the use of playbook, automation
Mean Time To Failure (MTTF)
Lightweight health check on server
(gPRC)
Every server has an HTTP server on a specific port that provides diagnostics and statistics for a given task
Standby policy
Active loadbalancing
Multiple hosts are available and the load is distributed dynamically
Spot on Spin Up
Containers
Microservices
Kubernetes
Functions
VM's
Hot Standby
Automated switch
Criteria defined and implemented
alerting of switching
Cold standby
Functions
Serverless (O to N-instances)
Manual switch policy and strategy
Support instructions and
technical documentation
Master Control Room (MCR)
MCR Confluence
Help Desk
On Call
Squad dedicated information
Confluence
Mattermost
Runbook
Probing
Dummy injection and looping: dummy data is injected per minute and returns the expected result (if not expected result is received), alerts depending on severity of service is created
Reliability
Defining and measuring service levels can help to determine if your system is reliable and healthy. They also help in understanding customer needs and setting expectations.
Disaster Recovery Plan (DRP) Compliancy
DRaaS (Disaster Recovery as a Service)
Fully automated/scripted
Entering DRP mode
Returning to normal
Switching from DC to DC
Documented
Versioned
Failover
Load balancing
Load Balancing strategy
Active/Passive
Switching Strategy defined
Session tracking
Stateful
Statelless
Active/Active
Retry-storms to make systems idempotent
Self-Healing system
What happens when machine dies, rack fails, or cluster goes offline
Network fails between two datacenters?
For each type of server that talks to other servers (its backends)
How to detect when backends die, and what to do when they die
How to terminate or restart without affecting clients or users
Rate limiting
Error handling
internally
externally
Backup/restore
procedure described
procedure tested and verified regularly
Frequency of test determined?
restore procedure tested on cold standby
restore procedure tested and integrated in operational mode
Automated?
Strategy?
Incident Response
alerting
Signify that an action need to be taken immediately in response to something that either is happening (reactive > too late) or about to happen (pro-active > good)
alert fatigue
When exposed to a large number of alerts, alerted people will become desensitized and will no longer react within the expected response times or even miss-out important alerts.
alerting strategy
pro-active monitoring
Artificial intelligence (LOOM)
response monitoring
records (anomalies without impact)
pages (immediate response)
auditive alerting
texting
auditive signal (ping or robot)
Notifications (awareness needed)
visual alerting (like Grafana)
Incident Mngt (SNOW)
Post-Mortem
Postmortems should be written for all significant incidents, regardless of whether or not they paged; postmortems that did trigger a page are even more valuable, as they likely point to clear monitoring gaps. This investigation should establish what happened in detail, find the root causes of the event, and assign actions to correct the problem or improve how it is addressed next time. SRE teams should operate under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.
EMIR/MIM
A template containing the following items
Description of problem statement and actions taken in timeline format
Root cause analaysis
Durations (incident occured, major impact since, incidents status acquired, end impact, impact duration)
Incident resolution (quick mitigation, workaround, structural solution, the involved parties,...)
Findings & Recoomendations/Solutions
Timeframe
As soon as possile after the
incident has been resolved
blameless
Postmortem should be focusing on contributing
Follow-Up
Solution should be part of the backlog and solved within the code baseline and lead to adapted tests
Automation
Humans add latency. A system that can avoid emergnecies that require human intervention will have higher availability.
cdportal.ing.net / theforge.ing.net
Continuous integration (CI)
Branch Strategy
Branch per feature, impediment, improvement related to the SNOW tickets
Release branches (master branch only used for deployment into the various environments)
Version branches (merge of feature branches with incidents branch, impediment branch, etc.)
Version Strategy
Frequency of building, testing, scanning
Build is automated
Mavenised
Major version
Minor version
Update version
Continuous Deployment
Deployment Pipelining
Make deployments easier and lowers
the delivery cycle time and reduces toil
Deployment Pipelines
IPC
Azure DevOps/Portal
Fruitloops (Front-End)
One Pipeline (TFS)
Deployment strategy
Config as code
(config of the application has a separate release lifecycle, version strategy, repository and branching)
Deploy as code
(deployment of artifacts is separated from the config, security, environment and policy/security pipeline)
Policy as code
(the security of an application is separated from the infra as code, config as code, deploy as code)
Infrastructure as Code / Config of Environment (Ansible)
roles : role that a machine takes
playbooks: constructing the operational world of the app
Change Management
Strategy
Progressive rollouts
Automated validation
Automated rollback
See loadbalancing strategy
Massive rollout (Big Bang)
downtime required defined Service windows
see switch toggeling
DCAB/PO Approvals
(from whom do we need approvals per environment?)
Release Management
Each time a developer is submitted, tests run on all software that may depend on that list of changes, either directly or indirectly. If the framework determines that the change likely broke other parts in the system, it notifies the owner of the submitted change. Some projects use a push-on-green system, where a new version is automatically pushed to production after passing the tests
Release Policy
Monthly releases (2x/yr 100% technical releases)
Daily/Weekly releases (1xmnth 100% technical release)
Release Calendar
External events
Release Window?
(Ideal days and hours)
Canary releases
+ automated rollback:
Test a new release on a small subset of a typical workload
Release Notes
Linked to features, incidents, problems, improvements
Linked to change management
(ServiceNow)
Automated validation checks (iValidate)
ServiceNow
Automated Evidence Delivery
Published and communicated
to stakeholders (The Forge)
Problem Mngt
Reliability Toolkit (RTK)
On IPC only!
Prometheus
Alert Manager
Grafana
Model Builder
Ansible Controller
Toil removal
Manual toil
hands-on time a human spend on a task
Repetitive toil
Work that you need to do over and over again.
Not to be confused with overhead like OCD
Automatable toil
task that could be designed way towards a machine
Tactical toil
tasks that are interruptive and reactive
No endurance value toil
Tasks that does not change the state of your application
Service growth toil
tasks that grows lineair with the traffic or service size
Scalability
Containerization
Load shifting
Provisioning
(riskier then loadshifting)
Provisioning combines both change management and capacity planning. Provisioning must be conducted quickly and only when necessary.
comprehensive monitoring
Monitoring is one of the primary means by which service owners keep track of a system’s health and availability. If you start monitoring, start at the user perspective otherwise you might be fixing problems that are not impacting users.
If you can’t monitor a service, you don’t know what’s happening, and if you’re blind to what’s happening, you can’t be reliable
when do I know when my system is healthy?
when do I know when my system is not healthy?
each incident should be reflected in the monitoring
Monitor only the top layer and services that impact directly the customer
White box monitoring
What is broken? Monitoring based on metrics exposed by the internals of your system (e.g. usage, http request totals). Typically covered by metric, log and distributed trace monitoring.
metric monitoring
Discover what went wrong through retrieve metrics about your application through the Four Golden Signals: traffic, saturation, latency and errors.
Four Golden Signals
Traffic
Saturation
Latency
Errors
Type of errors
Severity of errors
Number of errors
Prometheus
Graphite
Dashboarding (Grafana)
For each graph
Treshold (why)
when do we need to start acting before it degrades. Be aware that monitoring is always reactive and has a delay. Wrong thresholds will leave your customers already in the cold prior you get notification
What is expected normal behavior?
What is the impact of abnormal behavior
What is the graph measuring?
log monitoring
Tries to find an answer on how to fix
the issue and used for deep dive diagnostics.
Deep-dive diagnostics
ELK
Logstash
Dashboard (Kibana)
ElasticSearch (search engine)
Trace of a happy flow for a specific request from entry till exit
Trace of an unhappy flow for a specific request from entry till exit
Distributed trace monitoring
Allows you to follow a message through a chain of applications and figure out where something broke
TraceING
eReporter
Black box monitoring
Is my system working from a customer perspective? Monitoring based on extremely visible behavior (e.g. timeouts or delays, 404 pages). Typically achieved via synthetic or real user monitoring.
synthetic monitoring
Replicate customer requests to generate load.
(WTSS-BE / Rigor-NL)
DynaTrace
real-user monitoring
Listening in on real customer traffic
Organisation
Backlog Mngt
Impediments List
Features List
Agile
QBR defined
Epics
Features
Stories
Sprint (scrum)
Velocity
Kanban (Ops)
MVP overview
Releases
Strategy
80% functionality
20% technicality
(improvements, incidents, problems - everything
that keeps your risk mngt under control)
Team
ACE/CDPortal registered
ace.ing.net / cdportal.ing.net
All members up-to-date
Dev & Ops Eng.
Product Owner / Team Owner
Delegates
SRE Engineer
SREs should be people who hate manual work and who have the skill set necessary to automate said work through engineering.
External stakeholders
Internal clients
(ACE teams, spocs, usage, contact info,etc.)
SBO
MCR
Training
Documentation available
Training Plan available
Aim and purpose of the team
In general, an SRE team is responsible for the availability, latency (~delays), performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
Eliminating toil is one of SRE’s most important tasks. We define toil as mundane, repetitive operational work providing no enduring value, which scales linearly with service growth.
A key principle of any effective software engineering, not only reliability-oriented engineering, simplicity is a quality once lost, can be extraordinarily difficult to recapture. Nevertheless, a complex system is evolved always from a simple system that worked.