Please enable JavaScript.

Coggle requires JavaScript to display documents.

"Designing Data-Intensive Applications" by Martin Kleppmann…

- - - - System works correctly despite...
        
        hardware faults
        
        examples
        
        data centre black-out
        
        faulty SSD
        
        faulty RAM
        
        faulty HDD
        
        MTTF = 10-50 years
        
        network connectivity issue
        
        machine becomes unavailable in AWS
        
        machine requires a reboot following security update
        
        measures
        
        redundancy
        
        severity: 10-25% of outages
        
        software faults
        
        examples
        
        systematic error because of software bug
        
        runaway process which uses hardware resources
        
        a dependency slows down/becomes unresponsive/returns corrupted data
        
        cascading failures
        
        measures
        
        robust design
        
        testing
        
        process isolation
        
        "crash-only" processes
        
        measuring
        
        monitoring
        
        alerting (on odd state, or invariant violation, etc.)
        
        severity: 25-27% of outages
        
        human error
        
        severity: 30-35% of outages (source: "Why do Internet services fail, and what can be done about it? -- Oppenheimer et al.")
        
        examples
        
        misconfiguration
        
        measures
        
        better design systems & tools, to minimise opportunities for error
        
        better abstractions, APIs, admin interfaces
        
        make it hard to do the "wrong thing" and easy to do the "right thing"
        
        decouple environments (e.g. sandbox to "play" with the system v.s. production)
        
        test thoroughly at all levels (linting, unit tests, integration tests, manual tests)
        
        allow quick & easy recovery (e.g. roll back; progressive delivery)
        
        monitoring of performance metrics & error rates (telemetry, observability)
        
        good processes & management
    - - Mesure load & performance
        
        Describe load
        
        via "load parameters"
        
        for example
        
        ratio reads/writes
        
        number of simultaneous users
        
        number of daily active users
        
        cache hit-rate
        
        resquests per second
        
        distribution of users with some specific behaviours
        
        Where is the bottleneck?
        
        Describe performance
        
        experiment
        
        change load parameter
        
        keep system resources unchanged
        
        => how is performance affected?
        
        change system resources to get previous performance
        
        => how much do you need to increase system resources?
        
        metrics
        
        throughput
        
        latency distribution
        
        duration that a request is waiting to be handled
        
        response time distribution
        
        = service time + network delays + queueing delays
        
        distribution
        
        95th, 99th and 99.9th percentiles (p95, p99, p999)
        
        median (p50)
        
        service time distribution
        
        SLO
        
        SLA
      - Problems
        
        Queueing delays
        
        "Head-of-line blocking" i.e. slow requests holding up resources
        
        "tail latency amplification", i.e. for many requests, \( t_{operation} = \max_i (t_{request}^{[i]}) \)
        
        Volumes of
        
        reads
        
        writes
        
        data to store
        
        Complexity of the data
        
        Access Patterns
      - Solutions
        
        Scale vertically
        
        Scale horizontally
        
        "Share nothing" architecture
        
        "Elastic" architecture, i.e. automatically add nodes
    - - Core Design Principles
        
        Simplicity
        
        Evolvability
        
        Operability
        
        Ops responsibilities
        
        Monitoring
        
        Debugging & Root cause analysis
        
        Keeping software up-to-date
        
        Security & Security patches
        
        System's dependencies
        
        Preventive measures
        
        Capacity planning
        
        Best practices
        
        Configuration management
        
        Maintenance & Migrations
        
        Predictable processes
        
        Preserving knowledge about systems
        
        Dev responsibilities
        
        Observability
        
        Support for automation (e.g. APIs)
        
        Integration with standard tools (e.g. JMX)
        
        Avoiding SPOFs, "Cattle" but not "Pets"
        
        Good documentation (e.g. "If X, then Y")
        
        Good defaults + Ability to override defaults
        
        Self-healing + Ability to override if needed
        
        Predictable behaviour, Avoid bad surprises
      - Productivity
  - - - document model
        
        "schema-on-read"
        
        one-to-many relationships
        
        heterogenous data
        
        many types (making splitting them into tables impractical)
        
        structure of data determined by external systems
        
        no control
        
        may change over time
        
        examples
        
        log data
        
        events
        
        data & storage locality
      - relational model
        
        many-to-many
        
        "schema-on-write"
        
        relational model
    - - declarative languages
        
        are easier to work with
        
        can be optimised under the hood
        
        say "what" but not "how"
        
        are generally better to manipulate data
      - imperative languages
        
        tend to be more verbose
        
        tend to be harder to work with
        
        tend to be brittle, less evolutive, not to leverage newly introduced capabilities or optimisations
        
        tend to be abstracted away by declarative languages
  - - - LSM (Log-Structured Merge) trees + commit log
      - \(B^+\) trees + write-ahead log (WAL)
  - - - Message-Passing Dataflows
        
        a.k.a.
        
        Message brokers
        
        Message queues
        
        Message-Oriented Middleware
        
        pros
        
        can act as buffer if consumer is overloaded (reliability)
        
        can redeliver messages / prevent messages loss (reliability)
        
        can act as a load-balancer (discoverability / evolvability)
        
        can broadcast or multicast messages
        
        decouples producers and consumers (evolvability)