Please enable JavaScript.
Coggle requires JavaScript to display documents.
LIL - Learning Hadoop (Understanding Hadoop Core Components (Apache Hadoop…
LIL - Learning Hadoop
- Understanding Hadoop Core Components
-
File systems
default: HDFS
-
Modes
-
Pseudo-distributed
Uses the file system but is designed for testing. Implemented on a single node, on a single machine.
-
-
Files & JVM's
-
-
Full-distributed
-
-
Apache Hadoop Ecosystem
Ambari
Provisioning, managing and maintaining Hadoop clusters
-
-
-
-
-
-
-
-
-
-
Cloudera Hadoop
-
Languages, compilers: Pig/Hive
-
-
-
-
-
-
-
-
-
-
-
-
What?
SQL-like query, generating MapReduce code
-
-
-
-
-
-
-
-
When?
for ETL-like jobs
-
-
Process data
For high volumes, to aggregate
-
Concepts
-
Functions
rich function library
General: AVG, MAX, TOKENIZE
Relational: FILTER, MAPREDUCE
String: UPPERCASE, ...
Map: ROUND
...
-
- Why move away from relational databases
-
CAP - Consistency, Availability, Partitioning
-
-
-
-
-
-
-
When can we optimize
-
-
-
-
reduce phase
-
-
-
set threshold
kill long running jobs, suspend, etc.
-
-
- Understanding workflows and connectors
-
-
Flume
-
Use case
a lot of the behavioral data actually comes from log files. Makes sense to have a Library treating that.
-
-
-
Zookeeper
-
-
Remark:
it does not make sence to use it in pseudo-dist mode, only dist mode
- Understanding MapReduce 1.0
What is it?
Programming paradigm, designed by Google to be able to index all the information of the internet.
-
Map
-
-
output <key, value> pairs on each node
Reduce
-
-
aggregate sets of <key, value> pairs on some nodes
versions
MapReduce 1.0
distributed, scalable, cheap
-
-
How to code it?
steps
- Create a class
- Create a static Map class
- Create a static Reduce class
- Create a main function
4.1 Create a job
4.2 Job calls the Map and Reduce classes
-
-
-
-
-
-
- Setting up the Hadoop development environment
-
Libraries
Hive
SQL-like queries, creates batch (MapReduce) jobs
Impala
SQL-like query, interactive process
-
Mahout
ML algo (clustering, data mining, decision trees, etc.)
Spark
Resilient, distributed data sets
-
-
- Understanding MapReduce 2.0 YARN
-
-
-
-
- Visualizing Hadoop Output with Tools