Please enable JavaScript.
Coggle requires JavaScript to display documents.
RelationShip between MapReduce & HDFS - Coggle Diagram
RelationShip between MapReduce & HDFS
HDFS stores data in blocks across the cluster.
Each MapReduce map task processes one HDFS data block.
Efficient job execution relies on minimizing data locality misses.
If data isn’t local (data locality miss), the task fetches it from another node, causing delays.
Tasks run faster when they are on the same node as their data (data localization).
2 Clients ,data nodes and HDFS storage
Data Upload: Clients either upload large files or stream data to HDFS.
Storage: Data is stored in large blocks (64MB–128MB+) for parallel processing.
Replication: Each block is copied to multiple nodes (default: 3) for redundancy.
=> ensure the data is safe
Replication Process: First node replicates to a second, which replicates to a third.
3 MapReduce workloads : split into 2 phases
1 Input Setup:
Data is loaded into HDFS (bulk or streaming).
HDFS splits the data into large blocks (e.g., 128MB), replicates them across nodes, and the NameNode tracks block locations.
2 Job Setup:
A MapReduce job is submitted to the JobTracker, containing:
Input file path in HDFS.
Output file path in HDFS.
Classes for map and reduce functions.
Driver code for the job.
3 Job Initialization:
JobTracker schedules map and reduce tasks on nodes in the cluster.
Interacts with the NameNode to locate input data blocks.
4 - Map phase
Mappers process chunks of data (HDFS blocks), generating intermediate key-value pairs.
5 - sort phase
Mappers sort intermediate data by key and partition it (e.g., using hash functions) for reducers.
6 - shuffle phase
System ensures that sorted intermediate data reaches the appropriate reducer.
7 - reduce phase
Reducers merge and process sorted key-value pairs to generate the final result.
8 - result storage
Reducers write results to HDFS, triggering replication for redundancy.
9 - result extraction
Clients export results from HDFS files.
4 - Fault tolerance
If a task fails:
TaskTracker detects and reports to JobTracker.
JobTracker reschedules the task.
If a DataNode fails:
NameNode and JobTracker detect it.
Tasks on the failed node are rescheduled, and the NameNode replicates data to another node.
If a NameNode or JobTracker fails:
The entire cluster becomes unavailable.
5 - reading / writing files
Hadoop uses Input Formats to read data.
Input Formats parse data (e.g., text or sequence files) into records for the mappers to process.
partie 1