Please enable JavaScript.
Coggle requires JavaScript to display documents.
MapReduce Prepared by Ayesha Khan - Coggle Diagram
MapReduce
Prepared by Ayesha Khan
Combiners
Optional optimization step after map phase.
Performs local aggregation before shuffling.
Reduces data transferred across the network.
Suitable for associative and commutative operations.
Features
Scalability: Easily processes petabytes of data.
Fault Tolerance: Retries failed tasks, replicates data.
Data Locality: Moves code to data, not vice versa.
Simplicity: Abstracts complex parallelism.
Flexibility: Supports unstructured & structured data.
Cost-Effective: Runs on commodity hardware.
Applications of MapReduce
Log file analysis (e.g., web logs)
Indexing and crawling in search engines
Data transformation (ETL)
Large-scale machine learning (e.g., classification)
Text mining and NLP
Graph processing (e.g., PageRank)
Social network analytics
Execution Model
JobClient submits job.
ResourceManager (YARN) allocates tasks.
Map Tasks run on data-local nodes.
Intermediate Data stored temporarily.
Reduce Tasks write output to HDFS.
Core Phase
Map Phase
Groups intermediate values by key.
Ensures all values with same key go to the same reducer.
System handles sorting automatically.
Shuffle & Sort (Grouping by Key)
Groups intermediate values by key.
Ensures all values with same key go to the same reducer.
System handles sorting automatically.
Reduce Phase
Processes grouped keys.
Applies aggregation logic (e.g., sum, average).
Outputs final result (key, value).
Algorithms Using MapReduce
Matrix-Vector Multiplication
Mappers output selected columns.
Selection (σ condition)
Mappers compute partial products.
Reducers sum them to form result vector
rojection (π columns)
Mappers filter records by condition.
Union
Mappers emit all tuples; reducer removes duplicates.
Intersection
Reducer keeps common keys from datasets.
Difference
Reducer filters tuples in R not in S.
Workflow
Output
Final data written to HDFS.
Input Splitting
Dataset split into chunks.
Reduce Phase
Aggregates values per key.
Map Phase
Each split is processed to emit key-value pairs.
Shuffle & Sort
Grouping and redistributing intermediate data.
Limitations
High Latency: Not suitable for real-time.
Complex Functions: Custom logic can be hard to write.
Inefficient for Iterative Tasks (e.g., deep learning).
Disk I/O Overhead: Intermediate results written to disk.