Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data-Intensive App Design (Index (B-Tree p102 :star: Faster for read…
Data-Intensive App Design
Database
SQL vs NoSQL
SQL
better support for joins
support many-to-many relationships
NoSQL
schema flexibility
better performance due to data locality (only applies when large part of the doc is needed at the same time)
closer to how data is used in app code
cannot refer directly to a nested item within a document
poor support for join. need multiple (transactional) queries to emulate M-M relation,
which is slower than a join and more complxity for app
:!:generally recommended to keep doc fairly small and avoid writes that increase the doc size
Models
relational / SQL
data is organized into
relations
(SQL tables)
each relation is an
unordered collection of tuples
(SQL row)
support Many-to-Many relations via join
query optimizer can improve query efficiency
schema-on-write
schema is
explicit
and the database ensures all written data conforms to it
similar to static (compile-time) type checking
hierarchical / Document
every record has exactly 1 parent
schema-on-read
structure of data is
implicit
and only interpreted when data is read
similar to dynamic (runtime) type checking
entire document needs to be rewritten when updating, only modifications that do NOT change encoded size of doc can easily be performed in place
targets use cases where data comes in
self-contained documents
and
relationships between docs are rare
network / COSASYL
generalization of hierarchical model
a record can have multiple parents
links between records are NOT foreign keys, more like
pointers
, need link walk to get access path
querying and updating is
complicated and inflexible
if the access path is not known beforehand
graph-like
Property Graph
vertex
unique ID
set of outgoing edges
set of incoming edges
collection of properties
edge
unique ID
tail vertex (from vertex)
head vertex (to vertex)
properties
features
any vertex can connect to any other vertex
given any vertex, you can efficiently find both its incoming and outgoing edges, thus traverse the graph
you can store different kinds of relations in a single graph by using edge labels
query languages
Cypher
p74
each query could have multiple ways to execute, query optimizer will analyze and choose the best way
Triple-store
all information is stored in triple tuple:
(subject, predicate, object)
, e.g. (Jim, likes, bananas)
subject is equivalent to a vertex in a graph
object is either a)
a value in a primitive datatype
(string or number) or b)
another vertex in the graph
in a) predicate and object together is a
property
of the subject (vertex)
in b) predicate is an
edge
in the graph, with the subject as the tail vertex and the object as head vertex,
like
subject --[predicate]-> object
SPARQL
p80
semantic web
p78
websites publish data in a consistent format, allowing data to be shared and combined into a web of data, a database of everything
Resource Description Framework (RDF)
targets use cases where
anything is potentially related to everything
Query Languages
Declarative / SQL
describe what is the pattern of data you want, not how to do it
more concise, easier to use
decouples by hiding implementation details, allowing more room for optimization
can utilize parallelization
Imperative / CODASYL
describe how to do the query in certain operations in certain order
hard to parallelize cos it depends on specific order to execute
MapReduce
mechanism to perform
read-only
queries across many documents
map and reduce must be
pure
functions
cannot perform additional queries in map or reduce
must NOT have any side effect
Index
Hash Index
p94
Compaction
p94
Concerns
File format
prefer binary format
Deleting records
mark with
tombstone
and discard during merging in compaction
Crash recovery
restore by reading segment files from disk and find the latest version of data to restore.
Store snapshots of each segment to speedup
Partially written records
use
checksum
to detect and discard corrupted data
Concurrency control
have
only 1 writer thread
data file segments are
append-only
and
immutable
, so can be read by multiple threads concurrently
hash table
MUST fit in memory
range query is not efficient
SSTable & LSM-Tree
p98
:star: Faster for write
Sorted String Table - key-value pairs are
sorted by key
each key only appear once within each merged segment file
compaction is like a
mergesort
- copy to output file the lowest key among first keys in each segment file, and repeat until all files merged
:question:what if the same key appears in multiple input segment files?
all the values in one input file must be more recent than those in the other file
so just keep the value from the most recent segment and discard the older one
no longer need an index of all the keys, just need small part of the keys
group records into a block and compress before writing to disk,
and let index point to start of each compressed block
saving disk space and reduce I/O bandwidth used
append records received in a write request into an on-disk log (Write-Ahead-Log)
add the same records to MemTable, an in-memory balanced search tree (red-black or AVL)
when Memtable grows larger than threshold (a few MB), write it out to disk as an SSTable file, and delete corresponding WAL
to serve read request, first try find the key from Memtable, then most recent segment, then older one, etc. (use
bloom filter
to speedup)
run periodic compaction to combine segment files and discard old/deleted values
Compaction Strategies
:star: SizeTiered
newer and smaller SSTables are successively merged into older and larger SSTables
Leveled
key range split up into smaller SSTables and older data is moved into separate "levels",
allows compaction more incrementally and use less disk space
Pros & Cons
Pros
sequential read/write, much faster then random access
WAL to ensure no data loss due to app crash
Cons
write amplification
- a single write resluting multiple writes to the disk over time
SSD can only overwrite blocks limited times
the more written to disk, the fewer writes per second it can handle
write bandwidth is shared by initial write to WAL, flushing Memtable and compaction in background
B-Tree
p102
:star: Faster for read
key-value pairs sorted by key, allowing efficient lookup and range query
fixed-size
blocks/pages, closer to hardware design
each page identified by an address, allow page to refer to another like
pointers
, constructing a
tree of pages
Branching Factor
number of references to child pages in one page
Update
search for leaf page containing the key
change the value in that page
write the page back to disk, any ref to that page remains valid
Insert
find the page whose range encompasses the new key and add it.
if not enough space, split the page into 2 half-full pages and
update parent page to account for the new key ranges
:question: what if DB crashes half-way of a page splitting?
use Write-Ahead_log to append modifications before action
remains balanced all the time, cos splitting page doesn't attach child trees but bubble up until root page
Need concurrency control to avoid threads seeing inconsistent state
Optimizations
copy-on-write scheme
abbreviate keys to save disk space
additional pointers to point sibling pages for faster referencing
fractal trees
secondary index
p108
heap-file index
key in index is reference to row stores somewhere else
avoid duplicate data
efficient when updating value without changing the key
incur extra hop
clustered index
store the indexed row directly within an index
covering index
in between heap-file and clustered index, store part of a table's columns within the index
multi-column index
concatenated index
combines several fields into one key by appending them
R-Tree
for multi-dimension index
App Type
OnLine Transaction Processing
high availability
low latency
critical to business operations
OnLine Analytic Processing
Data warehousing
Schemas
p114
Star
fact table
each row represents an
event
occurred at a particular time
max flexibility for arbitrary analysis
dimension table
metadata of event, who/what/where/when/why/how
Snowflake
dimensions are further broken down to sub-dimensions
Extract-Transform-Load
optimized for analytic access patterns
Bottlenecks
bandwidth for getting data from disk to memory
bandwidth from main memory to CPU cache
Technique
vectorized processing
p120
bitmap encoding
p119
column-oriented storage
p118
significantly faster for ad-hoc analytical queries
materialized view
p122
data cude
p124
Dataflow Models
modle evolution
Assume all the servers will updated first, all the clients after.
Then only need backward compatibility on requests,
and only need forward compatibility on responses
message broker
p158
buffer msgs if recipient unavailable - improve reliability
redelivers to crashed/failed consumers - avoid message loss
decouples producers from consumers
allows broadcasting
asynchronous communication
usually one-way, use another channel for async responses
actor model
decouple logic code from threads to avoid race condition, locking, deadlock etc.
at-most-once - no guarantee on msg delivery
Data Distribution
shared-nothing architecture
p168
Replication
keep multiple copies of same data on different nodes
data redundancy
improve performance
Leader/Follower
1 leader node, M follower nodes
write request must be sent to leader first
after leader has written data to its local storage, it sends data change log to followers,
followers apply data change log to their own local storages in the same order as leader did
read query can be handled by either leader or followers
How to?
setup new followers
p176
take consistent snapshots of leader's DB periodically (better without locking the entire DB)
Copy the snapshot to the new follower nodes
follower then connect to leader and ask for data changes after the snapshot
Failover
p178
determine leader has failed by heartbeat etc.
choose a new leader, best candidate is the follower with most up-to-date data
reconfigure the system to consensus on the new leader
Issues
:!: in async replication, the new leader may not have received all the writes from the old leader before it failed, those msg are discarded
:!: split brain - multiple leaders
:!: hard to define a proper timeout. unnecessary failover will make it worse for system under high load or network issue
Sync vs Async
p174
Sync
followers guaranteed to have
up-to-date
copy of data,
consistent
to leader
Async
leader must block all writes until sync replica is available again
semi-sync
1 sync follower and others are async followers
Replication log
p180
Statement-based
log every write request (statement) and send to followers
corrupt data when
Any statement calls non-deterministic function
depending on existing data in DB
statements having side effects
Write-Ahead Log
log is append-only sequence of bytes containing all the writes to db
if replication log is coupled to storage implementation, schema change is hard to handle
Logical log replication
decouple log format from storage engine format
easier for backward compatibility
easier for external applications to parse
Trigger-based
more flexibility to handle replication in application code
greater overheads, more prone to bugs/limitations
Replication Lag
183
Leader-based app with async followers has delay in replication
read-your-writes consistency
p184
when reading data that the user may have modified, read from leader
:!: not applicable to where the user may edit most data, will cause most read request to leader node
:!: if replicas distributed across multiple datacenters, any request that needs to be served by leader MUST be routed to the datacenter that contains the leader node
track time of last update, read from leader only if last update was within 1 min or so
prevent read from followers that are more than 1 min behind leader
client can remember timestamp of its most recent write, then the system can ensure replica serving any reads for that user reflects updates at least until that timestamp *could be a logical timestamp (e.g. log sequence), cos clock sync is hard
if the same user is accessing your service from multiple devices, need
cross-device
read-your-writes consistency. e.g. route requests from all of a user's devices to the same datacenter containing leader
Monotonic reads
p185
a user will not see things moving backward in time when making multiple request to different follower nodes
make sure each user always makes their reads from the same replica. e.g. choose a replica based on hash of user ID
:!: if the replica fails, need to reroute user queries to another replica
Consistent Prefix Reads
p186
avoid violation of causality due to replication lag
if a sequence of writes happens in a certain order, then anyone reading those writes must see them appear in the same order
require
happens-before
relationship
make sure causually related writes are
written to the same partition
:question: Ask yourself, "
how the app behaves if the replication lag increases to minutes or hours?
"
Pretending sync replication when it's actually async is a recipe for problems
Multi-Leader
p189
mostly used in
multi-datacenter
app, 1 leader per dc
normal leader-follower update within dc, leader replicates its changes to leaders in other dcs
:!: conflicts need to be resolved between leaders
Pros
every write is processed within local dc, async replicate to other dcs, better performance
if a dc is down, still able to serve, and failed dc can catch up from leaders in other dc when back
tolerance to temporary network issue in public internet
Cons
same data may be concurrently modified in different dcs, causing
conflict
causing unexpected behavior on incremental keys, triggers, integrity constraints etc.
:!: Danger zone, avoid multi-leader if possible
:star: CouchDB is designed for multi-leader
:star:
Automatic Conflict Resolution
p195
Conflict-free Replicated DataTypes (Riak 2.0 supports)
Mergeable persistent data structures
Operational Transformation
Handle Write Conflict
p192
Converge toward a consistent state across replicas
all replica MUST arrive at the same final value when all changes have been replicated
Last Write Wins
#
give each write a unique ID, pick the write with highest ID
:!: prone to data loss
:!: only safe way - key is
written once and immutable
Most Replica Write Wins
writes originated at a higher-numbered replica wins over lower-numbered replica
:!: prone to data loss
Retain All Info and Resolve Later
record the conflict in an explicit data structure to preserve all information, write app code to resolve later
resolve on write
resolve as soon as conflict detected by DB, may run in background
resolve on read
resolve on next read
Leaderless
any node can handle writes as a coordinator
coordinator doesn't enforce ordering of writes
Read Repair
read from multiple nodes, return value with latest version and update stale data
Anti-Entropy Process
background process that checks stale data and updates it
Quorum Consistency
p202
#Write + #Read > #Replicas
:!:Limitations
p202
with sloppy quorum, w writes may end up on different nodes than the r reads
if 2 writes occur concurrently, writes can be lost due to clock skew
if write and read happen concurrently, the write may be reflected on only part of the replicas, not sure read returns old or new value
if write succeeded on some replicas but failed on others, succeeded ones
NOT rolled back
if a node with new value failed, and restored from a replica carrying an old value, replicas of new value may fall below w and break the quorum condition
Hinted Handoff
p204
once network interruption fixed, any writes the alternative node accepted on behalf will be sent to appropriate target nodes
Sloppy Quorums
p204
when client can access some nodes in DB, but not enough nodes for quorum on a specific value,
DB still accept writes but write to some alternative nodes as a hinted handoff
w writes and r reads still required, but may include alternative nodes as handoff node
:!: no guarantee to read latest value until hinted handoff has completed
Detecting Concurrent Writes
p206
conflicts can also happen during read repair or hinted handoff
events may arrive in different order at different nodes
:star: happens-before relationship
p208
A happens before B if
B knows about A
B depends on A
B builts upon A
A and B are
concurrent
if
neither happens-before the other
Version Vectors
p212
each replica increments its own version number when processing a write
also keeps track of version numbers it has seen from each of other replicas
dotted version vector in Riak 2.0
Partitioning
p220
split data into subsets (partitions), a.k.a. sharding
Transaction
chaptor7
concurrency control
dirty read
tx A reads uncommited data written by tx B
prevented by read-commited isolation level
dirty write
tx A overwrites uncommited data written by tx B
always prevented by most db
read skew (non-repeatable read)
data has been changed outside tx A after it's read
tx A sees different part of data at different point of time, not an immutable snapshot
prevented by snapshot isolation implemented by Multi Version Concurrency Control
Lost updates
2 tx concurrently perform conflicting read-modify-write cycle
solutions 1) manual lock conflicting data, 2)... :question:
write skew
tx A reads data into D, do things T depending on D and write T to DB. But D has changed before T is written back to DB
prevented by Serializable isolation
phantom read
tx A reads data that affected by tx B write
snapshot isolation prevents straightforward phantom reads,
but phantom reads in context of write skew require special treatment like index-range lock
Serializable Isolation
execute all tx in serial order
require each tx be small and fast and thruput fits in a single CPU
Two-phase locking
acquire lock at tx start and release lock at tx end
bad performance under high concurrency / contention
Serializable Snapshot Isolation
optimistic, proceed tx without blocking, check when committing, abort if not serializable
bad performance at high contention cos many tx will be aborted and retried unnecessarily
Failures
p296
Detecting Faults