Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 3: Data Storage Technology - Coggle Diagram

- - - - Flash Drives (1999)
        
        Cloud Storage
      - Secure Digital (SD) card (1999)
- - - - ensures that all operations will always succeed or fail completely
    - - ensures that the database will always remain in a consistent state by ensuring that only data that conforms to the constraints of the database schema.
    - - ensures that the results of a transaction are not visible to other operations until it is complete.
    - - ensures that the results of an operation are permanent
- - - - master-slave
        
        In master-slave replication, data is written to a master node and then copied to multiple slave nodes.
        
        Write requests (insert, update, and delete) go to the master, while read requests can be handled by any slave.
        
        To address read inconsistency, a voting system is used.
      - peer-to-peer
        
        All node operate at the same level
        
        Each node, a peer, is equally capable of handling reads and writes.
        
        Each write is copied to all peers.
        
        Prone to write inconsistencies – simultaneous update
  - - - Row Oriented
      - Fixed schema
      - Not optimized for sparse matrix
      - Optimized for join operation
      - Not integrated
      - Hard to shard and scale
      - Only for structured data
    - - Column oriented
      - Flexible schema, add column later
      - Good with sparse table
      - Join by using MR- not optimized
      - Tight integration with key-value system
      - Horizontal scalability
      - Good for semi structured, unstructured and structured data
- - - - Typically, 128 MB block size
      - Blocks are replicated for reliability
      - One replica on local node, another replica on a remote rack, Third replica on local rack, Additional replicas are randomly placed
    - - Data placement exposed so that computation can be migrated to data
    - - data not sent through namenode, client acces data directly from datanode
      - Throughput of file system scales nearly linearly with the number of nodes.
      - The NameNode is the core of Hadoop's file system.
        
        It manages the file system structure and grants access permissions.
        
        HDFS splits large files into blocks stored on data nodes.
        
        The NameNode keeps track of which blocks on which nodes form the complete file, and it handles all file operations like reads, writes, and replication on data nodes.
      - DataNodes(slave node) are like storage units in Hadoop HDFS.
        
        They hold blocks of files and are resilient.
        
        The NameNode manages access and replication of blocks across multiple DataNodes.
        
        This replication system works efficiently when all nodes are grouped into racks.
        
        The NameNode uses a "rack ID" to organize and track DataNodes in the cluster.
    - - master-slave architecture. Each cluster comprises a single master node and multiple slave node
        
        Master node
        
        stores and manages the file system namespace, that is information about blocks of files like block locations, permissions,
        
        slave node
        
        . The slave nodes store data blocks of files.
      - file divided into one or more blocks, and each block is stored different slave machines depend on replication factor
    - - Rack awareness
        
        The NameNode in Hadoop uses rack information to decide where to place data blocks on DataNodes.
        
        This ensures fault-tolerance and minimizes latency for reading and writing data.
        
        The replication policy typically creates three replicas:
        
        the first on the same node,
        
        the second on a different rack, and
        
        the third on a different node in the same rack.
        
        By choosing the closest rack, latency is reduced. This setup increases data availability, reliability, and improves network bandwidth
      - High availabilty
      - Data block
      - Replication management
      - data read and write operations