Chapter 3: Data Storage Technology

Evolution of Data Storage

Punch cards (1837-1980)

Floppy Disk (1967-1990)

Flash Drives (1999)

Cloud Storage

Secure Digital (SD) card (1999)

On-Disk Storage

Utilizes low cost hard-disk drives for long-term storage

Implemented via distributed file system or a database


Data wrangling is the process of organizing and refining data obtained from external sources by filtering, cleansing, and preparing it for analysis.

Distributed file-system

distributed file system is a file system that can store large files spread across the nodes of a cluster

it enables the files to be
accessed from multiple locations

A file system is the method of storing and organizing data on a storage device

Support schema-less data storage

A DFS storage device ensures redundancy and high availability by copying data to multiple locations through replication.

Simple, fast access data storage, non relational in nature, fast read/write capability

Multiple smaller file are generally combined into single file to enable optimum storage and processing

Relational DBMS

ACID-compliant – restricted to a single node.

Do not provide out-of-the-box redundancy and fault tolerance.

Less ideal for long-term storage of data that accumulates over time.

Manually sharded – complicates data processing when data from multiple shards is required.

Schema-based, not suitable for semi- & un-structured data

Data need to be checked against schema constraints – creates latency.


This is a transaction management approach that uses pessimistic concurrency controls to maintain consistency by applying record locks.

Atomicity

Consistency

Isolation

Durability

ensures that all operations will always succeed or fail completely

ensures that the database will always remain in a consistent state by ensuring that only data that conforms to the constraints of the database schema.

ensures that the results of a transaction are not visible to other operations until it is complete.

ensures that the results of an operation are permanent

NoSQL database

non-relational database that is highly scalable, fault-tolerant and specifically designed

to house semi-structured.

unstructured data

often provides an API-based query interface that can be called from within an application

also support query languages other than Structured Query Language (SQL) because SQL was designed to query structured data stored within a relational database.

Sharding

is splitting a big dataset into smaller parts called shards.

Each shard is stored on a different server or machine.

Every server holds only the data assigned to it.

All shards have the same structure and together they make up the full dataset

Replication

stores multiple copies of a dataset, replicas, on multiple nodes

Provides scalability and availability due to the fact that the same data is replicated on various nodes.

Fault tolerance is achieved since data redundancy ensures that data is not lost when an individual node fails.

Two different methods used

master-slave

peer-to-peer


In master-slave replication, data is written to a master node and then copied to multiple slave nodes.

Write requests (insert, update, and delete) go to the master, while read requests can be handled by any slave.

To address read inconsistency, a voting system is used.

All node operate at the same level

Each node, a peer, is equally capable of handling reads and writes.

Each write is copied to all peers.

Prone to write inconsistencies – simultaneous update

NoSQL VS SQL

SQL

NoSQL

Row Oriented

Fixed schema

Not optimized for sparse matrix

Optimized for join operation

Not integrated

Hard to shard and scale

Only for structured data

Column oriented

Flexible schema, add column later

Good with sparse table

Join by using MR- not optimized

Tight integration with key-value system

Horizontal scalability

Good for semi structured, unstructured and structured data

Hadoop Distributed File System (HDFS)

HDFS is a clustered system for handling files in big data environment.

It's not the final stop for files but provides crucial capabilities for managing high volumes and speeds of data.

Because data is written once and read many times, it's ideal for big data analysis.

Motivations for developing HDFS

Hardware failure

The need for streaming access

Large datasets

Data coherency issue

Cheaper in computation

heterogeneous platforms

Hadoop Distrbuted File System

HDFS works by breaking large files into smaller pieces called blocks.

The NameNode also acts as a “traffic cop,” managing all access to the files.

The blocks are stored on data nodes

Files are broken into large blocks.

Typically, 128 MB block size

Blocks are replicated for reliability

One replica on local node, another replica on a remote rack, Third replica on local rack, Additional replicas are randomly placed

Understands rack locality

Data placement exposed so that computation can be migrated to data

Client talks to both NameNode and DataNodes

data not sent through namenode, client acces data directly from datanode

Throughput of file system scales nearly linearly with the number of nodes.

HDFS Architecture

master-slave architecture. Each cluster comprises a single master node and multiple slave node

file divided into one or more blocks, and each block is stored different slave machines depend on replication factor

Master node

slave node

stores and manages the file system namespace, that is information about blocks of files like block locations, permissions,

. The slave nodes store data blocks of files.

The NameNode is the core of Hadoop's file system.

It manages the file system structure and grants access permissions.

HDFS splits large files into blocks stored on data nodes.

The NameNode keeps track of which blocks on which nodes form the complete file, and it handles all file operations like reads, writes, and replication on data nodes.


DataNodes(slave node) are like storage units in Hadoop HDFS.

They hold blocks of files and are resilient.

The NameNode manages access and replication of blocks across multiple DataNodes.

This replication system works efficiently when all nodes are grouped into racks.

The NameNode uses a "rack ID" to organize and track DataNodes in the cluster.

HDFS Key Features

Rack awareness

High availabilty

Data block

Replication management

data read and write operations

The NameNode in Hadoop uses rack information to decide where to place data blocks on DataNodes.

This ensures fault-tolerance and minimizes latency for reading and writing data.

The replication policy typically creates three replicas:

By choosing the closest rack, latency is reduced. This setup increases data availability, reliability, and improves network bandwidth

the first on the same node,

the second on a different rack, and

the third on a different node in the same rack.