Please enable JavaScript.
Coggle requires JavaScript to display documents.
"Designing Data-Intensive Applications" by Martin Kleppmann…
"Designing Data-Intensive Applications" by Martin Kleppmann
Foundations of Data Systems
Properties
Reliability
System works correctly despite...
hardware faults
examples
data centre black-out
faulty SSD
faulty RAM
faulty HDD
MTTF = 10-50 years
network connectivity issue
machine becomes unavailable in AWS
machine requires a reboot following security update
measures
redundancy
severity: 10-25% of outages
software faults
examples
systematic error because of software bug
runaway process which uses hardware resources
a dependency slows down/becomes unresponsive/returns corrupted data
cascading failures
measures
robust design
testing
process isolation
"crash-only" processes
measuring
monitoring
alerting (on odd state, or invariant violation, etc.)
severity: 25-27% of outages
human error
severity: 30-35% of outages (source: "Why do Internet services fail, and what can be done about it? -- Oppenheimer et al.")
examples
misconfiguration
measures
better design systems & tools, to minimise opportunities for error
better abstractions, APIs, admin interfaces
make it hard to do the "wrong thing" and easy to do the "right thing"
decouple environments (e.g. sandbox to "play" with the system v.s. production)
test thoroughly at all levels (linting, unit tests, integration tests, manual tests)
allow quick & easy recovery (e.g. roll back; progressive delivery)
monitoring of performance metrics & error rates (telemetry, observability)
good processes & management
Scalability
Mesure load & performance
Describe load
via "
load parameters
"
for example
ratio reads/writes
number of simultaneous users
number of daily active users
cache hit-rate
resquests per second
distribution of users with some specific behaviours
Where is the bottleneck?
Describe performance
experiment
change load parameter
keep system resources unchanged
=> how is performance affected?
change system resources to get previous performance
=> how much do you need to increase system resources?
metrics
throughput
latency
distribution
duration that a request is waiting to be handled
response time
distribution
= service time + network delays + queueing delays
distribution
95th, 99th and 99.9th percentiles (p95, p99, p999)
median (p50)
service time
distribution
SLO
SLA
Problems
Queueing delays
"
Head-of-line blocking
" i.e. slow requests holding up resources
"
tail latency amplification
", i.e. for many requests, \( t_{operation} = \max_i (t_{request}^{[i]}) \)
Volumes of
reads
writes
data to store
Complexity of the data
Access Patterns
Solutions
Scale vertically
Scale horizontally
"
Share nothing
" architecture
"
Elastic
" architecture, i.e. automatically add nodes
Maintainability
Core Design Principles
Simplicity
Evolvability
Operability
Ops responsibilities
Monitoring
Debugging & Root cause analysis
Keeping software up-to-date
Security & Security patches
System's dependencies
Preventive measures
Capacity planning
Best practices
Configuration management
Maintenance & Migrations
Predictable processes
Preserving knowledge about systems
Dev responsibilities
Observability
Support for automation (e.g. APIs)
Integration with standard tools (e.g. JMX)
Avoiding SPOFs, "Cattle" but not "Pets"
Good documentation (e.g. "If X, then Y")
Good defaults + Ability to override defaults
Self-healing + Ability to override if needed
Predictable behaviour, Avoid bad surprises
Productivity
Data Models & Query Languages
Relational Model v.s. Document Model
document model
"
schema-on-read
"
one-to-many relationships
heterogenous data
many types (making splitting them into tables impractical)
structure of data determined by external systems
no control
may change over time
examples
log data
events
data & storage locality
relational model
many-to-many
"
schema-on-write
"
relational model
Query Languages for Data
declarative languages
are easier to work with
can be optimised under the hood
say "
what
" but not "
how
"
are generally better to manipulate data
imperative languages
tend to be more verbose
tend to be harder to work with
tend to be brittle, less evolutive, not to leverage newly introduced capabilities or optimisations
tend to be abstracted away by declarative languages
Graph-Like Data Models
Storage & Retrieval
Underlying Data Structures
LSM (Log-Structured Merge) trees + commit log
\(B^+\) trees + write-ahead log (WAL)
Transaction Processing or Analytics
Column-Oriented Storage
Encoding & Evolution
Formats for Encoding Data
Modes of Dataflow
Message-Passing Dataflows
a.k.a.
Message brokers
Message queues
Message-Oriented Middleware
pros
can act as buffer if consumer is overloaded (reliability)
can redeliver messages / prevent messages loss (reliability)
can act as a load-balancer (discoverability / evolvability)
can broadcast or multicast messages
decouples producers and consumers (evolvability)
Distributed Data
Replication
Leaders & Followers
Problems with Replication Lag
Multi-Leader Replication
Leaderless Replication
Partitioning
Partitioning of Key-Value Data
Partitioning and Secondary Indexes
Rebalancing Partitions
Request Routing
Transactions
The Slippery Concept of a Transaction
Weak Isolation Levels
Serializability
The Trouble with Distributed Systems
Faults & Partial Failures
Unreliable Networks
Unreliable Clocks
Knowledge, Truth, and Lies
Consistency & Consensus
Linearizability
Ordering Guarantees
Distributed Transactions & Consensus
Derived Data
Batch Processing
Batch Processing with Unix Tools
MapReduce and Distributed Filesystems
Beyond MapReduce
Stream Processing
Transmitting Event Streams
Databases & Streams
Processing Streams
The Future of Data Systems
Data Integration
Unbundling Databases
Aiming for Correctness
Doing the Right Thing
Challenges with data
Amount of data
Complexity of data
Speed at which data changes
Terminology
Fault
one component of the system deviating from its spec
impossible to reduce probability to zero
therefore best to design fault-tolerance mechanisms that prevent faults from causing failures
generating faults (e.g. Chaos Monkey) may be a way to test for fault-tolerance
Failure
when the system as a whole stops providing the required service to the user