Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 3: Data Storage Technology - Coggle Diagram
Chapter 3: Data Storage Technology
Evolution of Data Storage
Punch cards (
1837-1980
)
Floppy Disk (
1967-1990
)
Flash Drives (
1999
)
Cloud Storage
Secure Digital (SD) card (
1999
)
On-Disk Storage
Utilizes low cost hard-disk drives for long-term storage
Implemented via distributed file system or a database
Data wrangling is the process of organizing and refining data obtained from external sources by filtering, cleansing, and preparing it for analysis.
Distributed file-system
distributed file system is a file system that can store large files spread across the nodes of a cluster
it enables the files to be
accessed from multiple locations
Support schema-less data storage
A DFS storage device ensures redundancy and high availability by copying data to multiple locations through replication.
Simple, fast access data storage, non relational in nature, fast read/write capability
Multiple smaller file are generally combined into single file to enable optimum storage and processing
A file system is the method of storing and organizing data on a storage device
Relational DBMS
ACID-compliant – restricted to a single node.
Atomicity
ensures that all operations will always succeed or fail completely
Consistency
ensures that the database will always remain in a consistent state by ensuring that only data that conforms to the constraints of the database schema.
Isolation
ensures that the results of a transaction are not visible to other operations until it is complete.
Durability
ensures that the results of an operation are permanent
Do not provide out-of-the-box redundancy and fault tolerance.
Less ideal for long-term storage of data that accumulates over time.
Manually sharded – complicates data processing when data from multiple shards is required.
Schema-based, not suitable for semi- & un-structured data
Data need to be checked against schema constraints – creates latency.
This is a transaction management approach that uses pessimistic concurrency controls to maintain consistency by applying record locks.
NoSQL database
non-relational database that is highly scalable, fault-tolerant and specifically designed
to house semi-structured.
unstructured data
often provides an API-based query interface that can be called from within an application
also support query languages other than Structured Query Language (SQL) because SQL was designed to query structured data stored within a relational database.
Sharding
is splitting a big dataset into smaller parts called shards.
Each shard is stored on a different server or machine.
Every server holds only the data assigned to it.
All shards have the same structure and together they make up the full dataset
Replication
stores multiple copies of a dataset, replicas, on multiple nodes
Provides scalability and availability due to the fact that the same data is replicated on various nodes.
Fault tolerance is achieved since data redundancy ensures that data is not lost when an individual node fails.
Two different methods used
master-slave
In master-slave replication, data is written to a master node and then copied to multiple slave nodes.
Write requests (insert, update, and delete) go to the master, while read requests can be handled by any slave.
To address read inconsistency, a voting system is used.
peer-to-peer
All node operate at the same level
Each node, a peer, is equally capable of handling reads and writes.
Each write is copied to all peers.
Prone to write inconsistencies – simultaneous update
NoSQL VS SQL
SQL
Row Oriented
Fixed schema
Not optimized for sparse matrix
Optimized for join operation
Not integrated
Hard to shard and scale
Only for structured data
NoSQL
Column oriented
Flexible schema, add column later
Good with sparse table
Join by using MR- not optimized
Tight integration with key-value system
Horizontal scalability
Good for semi structured, unstructured and structured data
Hadoop Distributed File System (HDFS)
HDFS is a clustered system for handling files in big data environment.
It's not the final stop for files but provides crucial capabilities for managing high volumes and speeds of data.
Because data is written once and read many times, it's ideal for big data analysis.
Motivations for developing HDFS
Hardware failure
The need for streaming access
Large datasets
Data coherency issue
Cheaper in computation
heterogeneous platforms
Hadoop Distrbuted File System
HDFS works by breaking large files into smaller pieces called blocks.
The NameNode also acts as a “traffic cop,” managing all access to the files.
The blocks are stored on data nodes
Files are broken into large blocks.
Typically, 128 MB block size
Blocks are replicated for reliability
One replica on local node, another replica on a remote rack, Third replica on local rack, Additional replicas are randomly placed
Understands rack locality
Data placement exposed so that computation can be migrated to data
Client talks to both NameNode and DataNodes
data not sent through namenode, client acces data directly from datanode
Throughput of file system scales nearly linearly with the number of nodes.
The NameNode is the core of Hadoop's file system.
It manages the file system structure and grants access permissions.
HDFS splits large files into blocks stored on data nodes.
The NameNode keeps track of which blocks on which nodes form the complete file, and it handles all file operations like reads, writes, and replication on data nodes.
DataNodes(slave node) are like storage units in Hadoop HDFS.
They hold blocks of files and are resilient.
The NameNode manages access and replication of blocks across multiple DataNodes.
This replication system works efficiently when all nodes are grouped into racks.
The NameNode uses a "rack ID" to organize and track DataNodes in the cluster.
HDFS Architecture
master-slave architecture. Each cluster comprises a single master node and multiple slave node
Master node
stores and manages the file system namespace, that is information about blocks of files like block locations, permissions,
slave node
. The slave nodes store data blocks of files.
file divided into one or more blocks, and each block is stored different slave machines depend on replication factor
HDFS Key Features
Rack awareness
The NameNode in Hadoop uses rack information to decide where to place data blocks on DataNodes.
This ensures fault-tolerance and minimizes latency for reading and writing data.
The replication policy typically creates three replicas:
the first on the same node,
the second on a different rack, and
the third on a different node in the same rack.
By choosing the closest rack, latency is reduced. This setup increases data availability, reliability, and improves network bandwidth
High availabilty
Data block
Replication management
data read and write operations