Please enable JavaScript.

Coggle requires JavaScript to display documents.

Site Reliability Engineering SRE is a way of identifying system…

- - - - CI's
        
        Firewalls
        
        Proxies
        
        Loadbalancing
        
        State (Active/Cold/Hot/StandBy)
      - Certificates
      - API Calls
    - - Outgoing Ports
      - Ingoing Ports
  - - - Authorisation
      - Authentication
  - - - API Marketplace
      - Finagle (networking library)
      - API Marketplace
    - - Migration to Merak
  - - - Support Offering
        
        Business applications
        
        DTAP
        
        Config Item (CI)
      - Assignment Groups
  - - - different service guarantee per service
        
        quantities per resources per cluster
        
        degree of redundancy of the cluster
        
        geogrpahical provisioning of the cluster
        
        critically of the cluster
        
        infra software configuration
        (f.ex;less memory)
- - - - Storage capacity
      - HTTP traffic and bandwidth estimates, launch “spike", traffic mix, 6 months out
      - Capacity per datacenter at max latency
        cfr. risk management for the organic and inorganic demand
      - End-ot-end testing
- - - - Development and production versions
      - no separate version in Fortify
      - Reworked till no comments
- - - - User-facing serving systems
        
        Availability (#nbr of requests succeeded)
        99,95% is a good start
        
        Request Latency (duration of a request)
        
        System throughput (requests per second)
      - Failure per request (type of failure/severity)
      - Error rate (# requests received/failed)
      - Storage systems
        
        Availability (#nbr of requests succeeded)
        99,95% is a good start
        
        Request Latency (time on read/write data)
        
        Durability (data there when we need it)
      - Big data systems
        
        Throughput (how mych data is processed)
        
        end-to-end latency (How long does it take the data to progress from ingestion to completion
      - Correctness (right answered returned, the right data retrieved, the right analysis done)
    - - SLO examples
        
        Business/Functional target
        
        Per component defined
        
        Per service defined
        
        HTTP QPS (queries per second)
        
        Latency (duration of a query)
        
        99.99% vs 100%
      - SLO Strategy selecting SLO's
        
        Don't pick a target based on current performance
        
        Keep it simple
        
        Avoid absolutes
        do not try to scale "infinitely" or be "always" available
        
        Have as few SLOs as possible
        
        Perfection can wait
      - SLO expectations
        
        Keep a safety margin
        
        Don't overachieve
    - - Technical target
  - - - Spot on Spin Up
        
        Containers
        
        Microservices
        
        Kubernetes
        
        Functions
      - VM's
    - - Automated switch
        
        Criteria defined and implemented
        
        alerting of switching
    - - Functions
        
        Serverless (O to N-instances)
      - Manual switch policy and strategy
  - - - MCR Confluence
    - - Squad dedicated information
        
        Confluence
- - - - Entering DRP mode
      - Returning to normal
      - Switching from DC to DC
  - - - Load Balancing strategy
        
        Active/Passive
        
        Switching Strategy defined
        
        Session tracking
        
        Stateful
        
        Statelless
        
        Active/Active
    - - How to detect when backends die, and what to do when they die
      - How to terminate or restart without affecting clients or users
    - - internally
      - externally
  - - - alert fatigue
        When exposed to a large number of alerts, alerted people will become desensitized and will no longer react within the expected response times or even miss-out important alerts.
      - alerting strategy
      - pro-active monitoring
        
        Artificial intelligence (LOOM)
      - response monitoring
        
        records (anomalies without impact)
        
        pages (immediate response)
        
        auditive alerting
        
        texting
        
        auditive signal (ping or robot)
        
        Notifications (awareness needed)
        
        visual alerting (like Grafana)
  - - - EMIR/MIM
        A template containing the following items
        
        Description of problem statement and actions taken in timeline format
        
        Root cause analaysis
        
        Durations (incident occured, major impact since, incidents status acquired, end impact, impact duration)
        
        Incident resolution (quick mitigation, workaround, structural solution, the involved parties,...)
        
        Findings & Recoomendations/Solutions
      - Timeframe
        As soon as possile after the
        incident has been resolved
      - blameless
        Postmortem should be focusing on contributing
      - Follow-Up
        Solution should be part of the backlog and solved within the code baseline and lead to adapted tests
  - - - Branch Strategy
        
        Branch per feature, impediment, improvement related to the SNOW tickets
        
        Release branches (master branch only used for deployment into the various environments)
        
        Version branches (merge of feature branches with incidents branch, impediment branch, etc.)
      - Version Strategy
        
        Frequency of building, testing, scanning
        
        Build is automated
        
        Mavenised
        
        Major version
        
        Minor version
        
        Update version
    - - Deployment Pipelining
        Make deployments easier and lowers
        the delivery cycle time and reduces toil
        
        Deployment Pipelines
        
        IPC
        
        Azure DevOps/Portal
        
        Fruitloops (Front-End)
        
        One Pipeline (TFS)
        
        Deployment strategy
        
        Config as code
        (config of the application has a separate release lifecycle, version strategy, repository and branching)
        
        Deploy as code
        (deployment of artifacts is separated from the config, security, environment and policy/security pipeline)
        
        Policy as code
        (the security of an application is separated from the infra as code, config as code, deploy as code)
        
        Infrastructure as Code / Config of Environment (Ansible)
        
        roles : role that a machine takes
        
        playbooks: constructing the operational world of the app
      - Change Management
        
        Strategy
        
        Progressive rollouts
        
        Automated validation
        
        Automated rollback
        
        See loadbalancing strategy
        
        Massive rollout (Big Bang)
        
        downtime required defined Service windows
        
        see switch toggeling
        
        DCAB/PO Approvals
        (from whom do we need approvals per environment?)
  - - - Monthly releases (2x/yr 100% technical releases)
      - Daily/Weekly releases (1xmnth 100% technical release)
      - Release Calendar
        
        External events
        
        Release Window?
        (Ideal days and hours)
      - Canary releases + automated rollback:
        Test a new release on a small subset of a typical workload
    - - Linked to features, incidents, problems, improvements
      - Linked to change management
        (ServiceNow)
        
        Automated validation checks (iValidate)
        
        ServiceNow
        
        Automated Evidence Delivery
      - Published and communicated
        to stakeholders (The Forge)
- - - - Four Golden Signals
        
        Traffic
        
        Saturation
        
        Latency
        
        Errors
        
        Type of errors
        
        Severity of errors
        
        Number of errors
      - Prometheus
      - Graphite
      - Dashboarding (Grafana)
        
        For each graph
        
        Treshold (why)
        when do we need to start acting before it degrades. Be aware that monitoring is always reactive and has a delay. Wrong thresholds will leave your customers already in the cold prior you get notification
        
        What is expected normal behavior?
        
        What is the impact of abnormal behavior
        
        What is the graph measuring?
    - - Deep-dive diagnostics
        
        ELK
        
        Logstash
        
        Dashboard (Kibana)
        
        ElasticSearch (search engine)
        
        Trace of a happy flow for a specific request from entry till exit
        
        Trace of an unhappy flow for a specific request from entry till exit
    - - TraceING
      - eReporter
- - - - QBR defined
        
        Epics
        
        Features
        
        Stories
        
        Sprint (scrum)
        
        Velocity
        
        Kanban (Ops)
        
        MVP overview
        
        Releases
    - - 80% functionality
      - 20% technicality
        (improvements, incidents, problems - everything
        that keeps your risk mngt under control)
  - - - All members up-to-date
        
        Dev & Ops Eng.
        
        Product Owner / Team Owner
        
        Delegates
        
        SRE Engineer
        SREs should be people who hate manual work and who have the skill set necessary to automate said work through engineering.
    - - Internal clients
        (ACE teams, spocs, usage, contact info,etc.)
      - SBO
      - MCR
    - - Documentation available
      - Training Plan available
    - - In general, an SRE team is responsible for the availability, latency (~delays), performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
      - Eliminating toil is one of SRE’s most important tasks. We define toil as mundane, repetitive operational work providing no enduring value, which scales linearly with service growth.
      - A key principle of any effective software engineering, not only reliability-oriented engineering, simplicity is a quality once lost, can be extraordinarily difficult to recapture. Nevertheless, a complex system is evolved always from a simple system that worked.