Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data-Intensive App Design (Index (B-Tree p102 :star: Faster for read…

- - - - better support for joins
      - support many-to-many relationships
    - - schema flexibility
      - better performance due to data locality (only applies when large part of the doc is needed at the same time)
      - closer to how data is used in app code
      - cannot refer directly to a nested item within a document
      - poor support for join. need multiple (transactional) queries to emulate M-M relation,
        which is slower than a join and more complxity for app
      - :!:generally recommended to keep doc fairly small and avoid writes that increase the doc size
  - - - data is organized into relations (SQL tables)
      - each relation is an unordered collection of tuples (SQL row)
      - support Many-to-Many relations via join
      - query optimizer can improve query efficiency
      - schema-on-write
        
        schema is explicit and the database ensures all written data conforms to it
        
        similar to static (compile-time) type checking
    - - every record has exactly 1 parent
      - schema-on-read
        
        structure of data is implicit and only interpreted when data is read
        
        similar to dynamic (runtime) type checking
      - entire document needs to be rewritten when updating, only modifications that do NOT change encoded size of doc can easily be performed in place
      - targets use cases where data comes in self-contained documents and relationships between docs are rare
    - - generalization of hierarchical model
      - a record can have multiple parents
      - links between records are NOT foreign keys, more like pointers, need link walk to get access path
      - querying and updating is complicated and inflexible if the access path is not known beforehand
    - - Property Graph
        
        vertex
        
        unique ID
        
        set of outgoing edges
        
        set of incoming edges
        
        collection of properties
        
        edge
        
        unique ID
        
        tail vertex (from vertex)
        
        head vertex (to vertex)
        
        properties
        
        features
        
        any vertex can connect to any other vertex
        
        given any vertex, you can efficiently find both its incoming and outgoing edges, thus traverse the graph
        
        you can store different kinds of relations in a single graph by using edge labels
        
        query languages
        
        Cypher p74
        
        each query could have multiple ways to execute, query optimizer will analyze and choose the best way
      - Triple-store
        
        all information is stored in triple tuple: (subject, predicate, object), e.g. (Jim, likes, bananas)
        
        subject is equivalent to a vertex in a graph
        
        object is either a) a value in a primitive datatype (string or number) or b) another vertex in the graph
        in a) predicate and object together is a property of the subject (vertex)
        in b) predicate is an edge in the graph, with the subject as the tail vertex and the object as head vertex,
        like subject --[predicate]-> object
        
        SPARQL p80
      - semantic web p78
        
        websites publish data in a consistent format, allowing data to be shared and combined into a web of data, a database of everything
        
        Resource Description Framework (RDF)
      - targets use cases where anything is potentially related to everything
  - - - describe what is the pattern of data you want, not how to do it
      - more concise, easier to use
      - decouples by hiding implementation details, allowing more room for optimization
      - can utilize parallelization
    - - describe how to do the query in certain operations in certain order
      - hard to parallelize cos it depends on specific order to execute
    - - mechanism to perform read-only queries across many documents
      - map and reduce must be pure functions
      - cannot perform additional queries in map or reduce
      - must NOT have any side effect
- - - - File format
        
        prefer binary format
      - Deleting records
        
        mark with tombstone and discard during merging in compaction
      - Crash recovery
        
        restore by reading segment files from disk and find the latest version of data to restore.
        Store snapshots of each segment to speedup
      - Partially written records
        
        use checksum to detect and discard corrupted data
      - Concurrency control
        
        have only 1 writer thread
        
        data file segments are append-only and immutable, so can be read by multiple threads concurrently
      - hash table MUST fit in memory
      - range query is not efficient
  - - - :star: SizeTiered
        
        newer and smaller SSTables are successively merged into older and larger SSTables
      - Leveled
        
        key range split up into smaller SSTables and older data is moved into separate "levels",
        allows compaction more incrementally and use less disk space
    - - Pros
        
        sequential read/write, much faster then random access
        
        WAL to ensure no data loss due to app crash
      - Cons
        
        write amplification - a single write resluting multiple writes to the disk over time
        SSD can only overwrite blocks limited times
        
        the more written to disk, the fewer writes per second it can handle
        
        write bandwidth is shared by initial write to WAL, flushing Memtable and compaction in background
  - - - number of references to child pages in one page
    - - search for leaf page containing the key
        
        change the value in that page
        
        write the page back to disk, any ref to that page remains valid
    - - find the page whose range encompasses the new key and add it.
        if not enough space, split the page into 2 half-full pages and
        update parent page to account for the new key ranges
    - - copy-on-write scheme
      - abbreviate keys to save disk space
      - additional pointers to point sibling pages for faster referencing
      - fractal trees
  - - - key in index is reference to row stores somewhere else
      - avoid duplicate data
      - efficient when updating value without changing the key
      - incur extra hop
    - - store the indexed row directly within an index
    - - in between heap-file and clustered index, store part of a table's columns within the index
  - - - combines several fields into one key by appending them
    - - for multi-dimension index
- - - - Schemas p114
        
        Star
        
        fact table
        
        each row represents an event occurred at a particular time
        
        max flexibility for arbitrary analysis
        
        dimension table
        
        metadata of event, who/what/where/when/why/how
        
        Snowflake
        
        dimensions are further broken down to sub-dimensions
    - - bandwidth for getting data from disk to memory
      - bandwidth from main memory to CPU cache
    - - vectorized processing p120
      - bitmap encoding p119
      - column-oriented storage p118
        
        significantly faster for ad-hoc analytical queries
      - materialized view p122
      - data cude p124
- - - - keep multiple copies of same data on different nodes
      - data redundancy
      - improve performance
      - Leader/Follower
        
        1 leader node, M follower nodes
        
        write request must be sent to leader first
        
        after leader has written data to its local storage, it sends data change log to followers,
        followers apply data change log to their own local storages in the same order as leader did
        
        read query can be handled by either leader or followers
        
        How to?
        
        setup new followers p176
        
        take consistent snapshots of leader's DB periodically (better without locking the entire DB)
        
        Copy the snapshot to the new follower nodes
        
        follower then connect to leader and ask for data changes after the snapshot
        
        Failover p178
        
        determine leader has failed by heartbeat etc.
        
        choose a new leader, best candidate is the follower with most up-to-date data
        
        reconfigure the system to consensus on the new leader
        
        Issues
        
        :!: in async replication, the new leader may not have received all the writes from the old leader before it failed, those msg are discarded
        
        :!: split brain - multiple leaders
        
        :!: hard to define a proper timeout. unnecessary failover will make it worse for system under high load or network issue
      - Sync vs Async p174
        
        Sync
        
        followers guaranteed to have up-to-date copy of data, consistent to leader
        
        Async
        
        leader must block all writes until sync replica is available again
        
        semi-sync
        
        1 sync follower and others are async followers
      - Replication log p180
        
        Statement-based
        
        log every write request (statement) and send to followers
        
        corrupt data when
        
        Any statement calls non-deterministic function
        
        depending on existing data in DB
        
        statements having side effects
        
        Write-Ahead Log
        
        log is append-only sequence of bytes containing all the writes to db
        
        if replication log is coupled to storage implementation, schema change is hard to handle
        
        Logical log replication
        
        decouple log format from storage engine format
        
        easier for backward compatibility
        
        easier for external applications to parse
        
        Trigger-based
        
        more flexibility to handle replication in application code
        
        greater overheads, more prone to bugs/limitations
      - Replication Lag 183
        
        Leader-based app with async followers has delay in replication
        
        read-your-writes consistency p184
        
        when reading data that the user may have modified, read from leader
        :!: not applicable to where the user may edit most data, will cause most read request to leader node
        :!: if replicas distributed across multiple datacenters, any request that needs to be served by leader MUST be routed to the datacenter that contains the leader node
        
        track time of last update, read from leader only if last update was within 1 min or so
        
        prevent read from followers that are more than 1 min behind leader
        
        client can remember timestamp of its most recent write, then the system can ensure replica serving any reads for that user reflects updates at least until that timestamp *could be a logical timestamp (e.g. log sequence), cos clock sync is hard
        
        if the same user is accessing your service from multiple devices, need cross-device read-your-writes consistency. e.g. route requests from all of a user's devices to the same datacenter containing leader
        
        Monotonic reads p185
        
        a user will not see things moving backward in time when making multiple request to different follower nodes
        
        make sure each user always makes their reads from the same replica. e.g. choose a replica based on hash of user ID
        :!: if the replica fails, need to reroute user queries to another replica
        
        Consistent Prefix Reads p186
        
        avoid violation of causality due to replication lag
        
        if a sequence of writes happens in a certain order, then anyone reading those writes must see them appear in the same order
        
        require happens-before relationship
        
        make sure causually related writes are written to the same partition
        
        :question: Ask yourself, "how the app behaves if the replication lag increases to minutes or hours?"
        Pretending sync replication when it's actually async is a recipe for problems
      - Multi-Leader p189
        
        mostly used in multi-datacenter app, 1 leader per dc
        
        normal leader-follower update within dc, leader replicates its changes to leaders in other dcs
        :!: conflicts need to be resolved between leaders
        
        Pros
        
        every write is processed within local dc, async replicate to other dcs, better performance
        
        if a dc is down, still able to serve, and failed dc can catch up from leaders in other dc when back
        
        tolerance to temporary network issue in public internet
        
        Cons
        
        same data may be concurrently modified in different dcs, causing conflict
        
        causing unexpected behavior on incremental keys, triggers, integrity constraints etc.
        
        :!: Danger zone, avoid multi-leader if possible
        
        :star: CouchDB is designed for multi-leader
        
        :star: Automatic Conflict Resolution p195
        
        Conflict-free Replicated DataTypes (Riak 2.0 supports)
        
        Mergeable persistent data structures
        
        Operational Transformation
        
        Handle Write Conflict p192
        
        Converge toward a consistent state across replicas
        
        all replica MUST arrive at the same final value when all changes have been replicated
        
        Last Write Wins #
        
        give each write a unique ID, pick the write with highest ID
        
        :!: prone to data loss
        
        :!: only safe way - key is written once and immutable
        
        Most Replica Write Wins
        
        writes originated at a higher-numbered replica wins over lower-numbered replica
        
        :!: prone to data loss
        
        Retain All Info and Resolve Later
        
        record the conflict in an explicit data structure to preserve all information, write app code to resolve later
        
        resolve on write
        
        resolve as soon as conflict detected by DB, may run in background
        
        resolve on read
        
        resolve on next read
      - Leaderless
        
        any node can handle writes as a coordinator
        
        coordinator doesn't enforce ordering of writes
        
        Read Repair
        
        read from multiple nodes, return value with latest version and update stale data
        
        Anti-Entropy Process
        
        background process that checks stale data and updates it
        
        Quorum Consistency p202
        
        #Write + #Read > #Replicas
        
        :!:Limitations p202
        
        with sloppy quorum, w writes may end up on different nodes than the r reads
        
        if 2 writes occur concurrently, writes can be lost due to clock skew
        
        if write and read happen concurrently, the write may be reflected on only part of the replicas, not sure read returns old or new value
        
        if write succeeded on some replicas but failed on others, succeeded ones NOT rolled back
        
        if a node with new value failed, and restored from a replica carrying an old value, replicas of new value may fall below w and break the quorum condition
        
        Hinted Handoff p204
        
        once network interruption fixed, any writes the alternative node accepted on behalf will be sent to appropriate target nodes
        
        Sloppy Quorums p204
        
        when client can access some nodes in DB, but not enough nodes for quorum on a specific value,
        DB still accept writes but write to some alternative nodes as a hinted handoff
        
        w writes and r reads still required, but may include alternative nodes as handoff node
        
        :!: no guarantee to read latest value until hinted handoff has completed
        
        Detecting Concurrent Writes p206
        
        conflicts can also happen during read repair or hinted handoff
        
        events may arrive in different order at different nodes
        
        :star: happens-before relationship p208
        
        A happens before B if
        
        B knows about A
        
        B depends on A
        
        B builts upon A
        
        A and B are concurrent if neither happens-before the other
        
        Version Vectors p212
        
        each replica increments its own version number when processing a write
        
        also keeps track of version numbers it has seen from each of other replicas
        
        dotted version vector in Riak 2.0
    - - split data into subsets (partitions), a.k.a. sharding
- - - - tx A reads uncommited data written by tx B
      - prevented by read-commited isolation level
    - - tx A overwrites uncommited data written by tx B
      - always prevented by most db
    - - data has been changed outside tx A after it's read
      - tx A sees different part of data at different point of time, not an immutable snapshot
      - prevented by snapshot isolation implemented by Multi Version Concurrency Control
    - - 2 tx concurrently perform conflicting read-modify-write cycle
      - solutions 1) manual lock conflicting data, 2)... :question:
    - - tx A reads data into D, do things T depending on D and write T to DB. But D has changed before T is written back to DB
      - prevented by Serializable isolation
    - - tx A reads data that affected by tx B write
      - snapshot isolation prevents straightforward phantom reads,
        but phantom reads in context of write skew require special treatment like index-range lock
  - - - require each tx be small and fast and thruput fits in a single CPU
    - - acquire lock at tx start and release lock at tx end
      - bad performance under high concurrency / contention
    - - optimistic, proceed tx without blocking, check when committing, abort if not serializable
      - bad performance at high contention cos many tx will be aborted and retried unnecessarily