Please enable JavaScript.
Coggle requires JavaScript to display documents.
Hadoop HDFS and MapReduce By Muhammad Ruzain…
Hadoop HDFS and MapReduce
By Muhammad Ruzain Rar 221226
Hadoop Overview
Apache open-source Big Data framework
Handles storage + processing of huge datasets
Key components: HDFS, MapReduce, YARN
HDFS (Hadoop Distributed File System)
Stores large files across multiple machines
NameNode: Controls metadata & namespace
DataNodes: Store blocks of actual data
Block Size: 128MB (default), large chunks
Replication: 3 copies (default) for fault tolerance
Write Once, Read Many model
Benefits: Fault-tolerant, scalable, high throughput
MapReduce
Batch processing model for large-scale data
Map: Converts input into (key, value) pairs
Shuffle & Sort: Organizes mapped output
Reduce: Combines and processes grouped data
Example: Word Count → Count occurrences of each word
Benefits: Scalable, Parallel processing
Real-life Applications
Facebook: Log analysis
Amazon: Product recommendation
LinkedIn: Skill matching engine
Banking: Fraud detection
Telecom: Network traffic analysis
Use Cases
Word count
Log parsing
Clickstream analysis
Data transformation & ETL jobs
Advantages
Handles petabytes of data
Cost-effective (commodity hardware)
High availability & fault-tolerance
Horizontally scalable
Limitations
High latency (not real-time)
Not suitable for small tasks
Debugging is complex
Less efficient than newer engines (e.g. Spark)
Future Trends
Shift to Apache Spark (in-memory faster processing)
Cloud-based Hadoop (AWS EMR, Azure HDInsight)
Integration with AI/ML pipelines
Real-time hybrid models (Lambda Architecture)
Secure & privacy-aware Big Data platforms (GDPR)