Hadoop ECO System
A Coggle Diagram about Managers\ task planners
(Cloudera (Cloudera Distribution including Apache Hadoop)
- Cloudera Manager with all popular tools
- installing and monitoring all needed packages
- trying to get new features first
- selling their own solutions, not only consulting
- a lot of optimizations
- partner program with amazon
- M3 has cutted functional
- manual setup of environment
- manual installing of packages
- fixing configuration files
- (Hortonworks Data Platform) one general solution
- instead of developing own tools investing into existing Apache products
- HDP looks more stable then CDH
- everything what is possible to process parallel is processing paralely
- Save intermidiate results to disc
Use idea of local data , but do most calculations in memory instead of disc.
resilient distributed datase
Spark has interfaces for Scala, Java and
Alternative engine from Hortonworks
main principal Directed acyclic graph
used mainly in Hive so far.
), SQL tools
for analysis of historical records.
, Data Types
Optimized format for Hive.
Columnar format optimized for saving complicated structures and effective compressing . Used by Spark and Impala.
Can send schema with the data
or can work with dynamically typed objects.
), Hadoop Distributed File System
It's a special file system
, Advanced Analytic
- Colaboration filtering
- Clasterization algorithms
- randomm forest
So far it uses mapreduce engine but this going to be changed to spark engine
- Basic statistics
- linear and logistic regresion
Has Python interface — NumPy
), NOSQL: HBase
Allows working with different records in real time.
New records are added into sorted structure in memory , and only when its achive restricted volume it is sent to disc.
, Import: Apache Kafka
Sends messages to disc immediately and keep these data configured amount of days. Easy salable.
- Kafka is not lie about reliability
- consumer groups is not working (all messages will be given to all consumers)
- server do not saves offsets for consumers
and Spark Streaming
Can take data from Kafka, ZeroMQ,soket , Twitter etc.
DStream interface— collection of small RDD, which are got for fixed time range