Please enable JavaScript.
Coggle requires JavaScript to display documents.
Hadoop & Hadoop Infrastructure - Coggle Diagram
Hadoop & Hadoop Infrastructure
definition
Hadoop is a widely used technology for Big Data.
It’s an open-source framework for processing large datasets across many computers.
It uses a simple programming model called MapReduce.
It has won recognition, like the Terabyte Sort Benchmark.
Hadoop is efficient; Yahoo sorted 1 terabyte of data in 209 seconds using it.
why
Extracts value from all types of data (structured/unstructured).
Scales with your business to petabyte levels.
Affordable storage using commodity hardware(low cost).
Open-source, no vendor lock-in(not tied to a specific provider), rich ecosystem.
Drives revenue and controls costs effectively.
8 common hadoop-able problems (prbls qui peuvent être résolu avec hadoop)
Modeling true risk
: Aggregates and analyzes diverse data to assess risk (e.g., sentiment analysis, graphs).
Customer churn analysis
: Builds behavioral models to predict customer loss
Recommendation engine
: Predicts user preferences using collaborative filtering
POS transaction analysis
: Optimizes promotions and operations from Point of Sale data
Network failure prediction
:Analyzes sensor data to detect and prevent failures.
Fraud detection
: Identifies anomalies and threats with pattern recognition.
Search quality analysis
: Improves real-time search results using pattern recognition
Data sandbox
: stores and explores unstructured data for insights
infrastructure
système distribué
data model
distributed databases :
deal with tables and relations .
must have a schema for data
data fragmentation & partitioning
Hadoop :
deal with flat files in any format
no schema for data
files are divide automatically into blocks
computing model
distributed databases :
database transaction : series of operations ensuring data consistency
ACID : atomicity : All operations complete or none do consistency : Data remains accurate and valid. isolation:Transactions run independently
Durability: Changes persist even after failures.
steps : Initial State → Operations (INSERT, UPDATE, DELETE) → Commit (finalize) or Rollback (revert).
hadoop
MapReduce Computing Model:Used for processing large data sets with a distributed algorithm on a cluster.
Division of Job into Tasks
Each task is either a map or reduce operation.
framework
2 main layers
HDFS
(hadoop distributed file system ) :Virtual file system store large datasets by splitting files into many small pieces.
Each small file is replicated and stored on 3 servers(datanodes) for fault tolerance.-> tolerance aux pannes
execution engine
(mapReduce) : is a programming model and processing engine used to process large datasets in parallel across a distributed cluster
map: divides the input data into smaller chunks and processes them in parallel, generating key-value pairs
Reduce: Collects and combines the intermediate results from the Map phase based on keys, producing the final output.
Allows scalable use of CPU power.
Diagram shows master-slave architecture with job and task trackers.
ex : join two datasets :
Stockage : Les datasets sont stockés sur HDFS, garantissant un stockage distribué et fiable.
Phase Mapping : Les datasets sont divisés en morceaux. Les mappers traitent ces morceaux en parallèle et créent des paires clé-valeur (ex. : un ID commun entre les datasets).
Shuffle et Sort : Les paires clé-valeur sont triées et regroupées par clé (les clés communes aux deux datasets).
Phase Reducer : Les reducers prennent les paires triées, joignent les enregistrements ayant la même clé (par exemple, associer un utilisateur à son département) et produisent le résultat final.
ex : word counts