Please enable JavaScript.
Coggle requires JavaScript to display documents.
chapter 7 : Hadoop and Databses, Chapter 8 : the hadoop implementation -…
chapter 7 : Hadoop and Databses
1 relational databases
: best used for
interactive OLAP analytics
multistep ACID transaction
100 % SQL compliance
3 - typical datacenter architecture
traditional
Enterprise Website: Collects data.
Interactive Database: Stores data from the website.
Data Export: Sends data to OLAP systems.
OLAP System: Processes data for analysis.
Business Intelligence Apps: Analyzes and reports insights.
External Systems: Integrates with Oracle, SAP, etc.
Hadoop-Integrated Data Architecture:
Enterprise Website: Collects data, stored in an interactive database.
New Data Flow: Redirects data to Hadoop for advanced processing.
Hadoop Framework: Handles large-scale data with dynamic OLAP queries.
Business Intelligence Apps: Uses processed Hadoop data for insights.
Integration: Combines modern Hadoop with traditional systems (Oracle, SAP).
2 - Hadoop
: best used for
structured or not (flex)
scalability of storage/ compute
complex data processing
4 - the Key benefits
schema-on-write (RDBMS)
Predefined Schema: Schema must be created before data is loaded.
Explicit Load Operation: Data must be transformed into the database's internal structure before storing.
Adding Columns: New columns must be added explicitly before new data can be loaded into the database.
Schema-on-Read (Hadoop):
Data Copying: Data is copied into the file store without needing transformation.
Late Binding: A Serializer/Deserializer (SerDe) is applied at read time to extract relevant data.
Flexibility: New data flows into the system immediately and will appear retroactively once the SerDe is updated.
5 - flexibility in data processing
Java MapReduce : High flexibility and performance, but requires complex and time-consuming development (low-level, "assembly language" of Hadoop).
Streaming MapReduce (Pipes):
Allows development in any language but offers slightly lower performance and flexibility compared to native Java MapReduce.
Crunch:Java library for multi-stage MapReduce pipelines, inspired by Google’s Flume.
pig Latin : High-level language from Yahoo, designed for batch data flow workloads.
hive : SQL-like interface from Facebook, including a meta store and SerDes for mapping files to schemas.
oozie : XML-based workflow engine for managing job sequences, using any of the above technologies.
Chapter 8 : the hadoop implementation
1 - Job execution
Job Submission:A user submits a job (query) via the Hadoop Client.
Master Node (JobTracker):The JobTracker in the Master Node receives the job and divides it into tasks.
It schedules the tasks and assigns them to Slave Nodes.
Task Execution on Slave Nodes:Slave Nodes execute the assigned tasks.
Each Slave Node runs a TaskTracker that handles individual tasks (Child Tasks).
Data Processing:The DataNodes on the Slave Nodes store and retrieve the data for processing.
Job Completion:Once the tasks are completed, the results are sent back to the Hadoop Client.
After each Map-Shuffle-Reduce cycle, the results are written back to HDFS.
def terms :
Slave nodes execute tasks and store data.
Master nodes manage the cluster and coordinate tasks.
A Hadoop cluster is a system of master and slave nodes working together to process large datasets.
2 hadoop data types
Java Objects: In Java, some objects can’t be changed (immutable), so every time you need a new value, a new object is created. This can slow down performance.
Hadoop Writable Types: To make things faster, Hadoop uses special types called Writable that can be changed without creating new objects. This reduces the overhead.
Retaining Values: If you need to keep values across tasks, you have to copy or clone them in Hadoop.
Writable Interface: All of Hadoop’s mutable types follow a rule (the Writable interface), which ensures they can be changed directly.
Examples of Data Types:
int/Integer → IntWritable
long/Long → LongWritable
String → Text
boolean → BooleanWritable
job configuration : Configuration API: This is the tool used to define and manage these settings.
ex :
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Input and Output Formats:
Input and Output: Hadoop needs a way to read and write data. But data is often stored as byte arrays (just raw data), not as structured objects.
InputFormat: Decides which part of the data each map task will work on.
OutputFormat: Decides where the result of each reduce task will go.
RecordReader & RecordWriter: Help in converting data between raw byte format and Hadoop’s objects.