Please enable JavaScript.

Coggle requires JavaScript to display documents.

LIL - Learning Hadoop (Understanding Hadoop Core Components (Apache Hadoop…

- - - - ... and all Hadoop processes run in different JVM's
  - - - Characteristics
        
        Large chunks of data
        
        Triple replication
      - Modes
        
        Distributed
        
        3 copies
        
        Pseudo-distributed
        
        Uses the file system but is designed for testing. Implemented on a single node, on a single machine.
    - - Called 'standalone'
    - - Examples
        
        AWS: S3
        
        Azure: BLOB
  - - - Local file system
      - Single JVM
    - - HDFS
      - JVM deamons run all process on single machine
    - - HDFS
        
        Triple replication
      - Deamons running on various locations
  - - - Provisioning, managing and maintaining Hadoop clusters
    - - Log collector
    - - Coordination of workflow groups
    - - Workflow
    - - Scripting
    - - Machine learning
    - - Distributed Processing Framework
  - - - Low level of abstraction
        
        Job Designer
        
        Allows you to work with MapReduce
      - High level of abstraction
        
        Pig
        
        Impala
        
        Hive
        
        DB Query
    - - HBase
      - Metastore tables
      - Sqoop
      - Zookeeper
  - - - AWS - Elastic MapReduce
      - Microsoft - HDInsight
  - - - EC2
        
        Run your own servers and install Hadoop on Linux servers that you manage yourself.
      - EMR
        
        Uses EC2 servers also, but management is supported by AWS.
        Data comes from the S3 file system. MapReduce can be done against S3 directly. Or you load the data in the HDFS first. Then create clusters (#n EC2 clusters w JVM's w Hadoop deamons installed), then execute jobs.
        
        WARNING : EMR IS NOT FREE SO TURN IT OFF AFTER FREE TOUR
    - - HDInsight
        
        Similar to EMR in AWS
        
        Rem: by default in Microsoft you are not running on Linux but on Microsoft
        
        Steps: sensor data loaded into Windows Azure Storage Blob. Hive tables to query sensor data. Hive queries, execution, excel load, visualize.
- - - - Apache Hadoop
    - - Cloudera
      - Hortonworks
      - MapR
    - - AWS
      - Windows Azure HDInsight
  - - - Scales to pétabytes or more
    - - Parallel data processing
    - - Suited for 'Big Data'
  - - - Abnormal transaction on credit card
    - - Netflix
- - - - What is HBase?
        
        Wide column NoSQL database
        
        Uses CREATE TABLE over HDFS data
  - - - exemple: wordCount is very hard to create
- - - - Transform data
      - Clean data
      - Process data
        
        For high volumes, to aggregate
  - - - Field < Tuple < Bag < Relation (database)
      - FILTER
        
        similar to a where clause
    - - rich function library
        
        General: AVG, MAX, TOKENIZE
        Relational: FILTER, MAPREDUCE
        String: UPPERCASE, ...
        Map: ROUND
        ...
      - UDF
        
        User Defined Functions
- - - - Transactions
    - - Ability to duplicate data
    - - Ability to store the data in separate parts
  - - - Transactional > not a good fit
    - - Bach processed > good fit
  - - - HDFS - Hadoop Distributed File System
    - - key/value
      - wide columnstore
        
        HBase
      - ...
    - - MySQL
      - SQL Server
      - Oracle
- - - - Input file compression
  - - - data pre-processing
    - - change compression ratio
      - break data into chuncks
    - - subdivide tasks
      - custom partitioner
      - skip bad records & reduce the amount of data
      - logging & counters
      - spill ratio tuning
        
        quand ta ram est full et que tu vas sur le HD
      - local reducers
        
        called combiners
      - map-only jobs
        
        photo processing e.g.
    - - subdividing tasks (chaining jobs)
      - logging / debugging
      - secondary sort
        
        tricky you have a bifi job doing more than one task which is usually not recommended
      - set threshold
        
        kill long running jobs, suspend, etc.
- - - - Schedule reoccuring jobs
      - Workflow schedular library for Hadoop jobs
  - - - Command-line utility for transferring data between RDBMS systems and Hadoop
      - Can be used to load directly into Hive or HBase tables
  - - - Library for working with log data
    - - a lot of the behavioral data actually comes from log files. Makes sense to have a Library treating that.
    - - streaming data
    - - agents
      - channels
      - data sinks
    - - COMPLEX
  - - - Centralized service for Hadoop config info
    - - Distributed in-memory computation
    - - it does not make sence to use it in pseudo-dist mode, only dist mode
- - - - Map
        
        execute Map() function on the data
        
        execute it on each node
        
        output <key, value> pairs on each node
      - Reduce
        
        execute Reduce() function on the data
        
        execute it on some nodes
        
        aggregate sets of <key, value> pairs on some nodes
    - - MapReduce 1.0
        
        distributed, scalable, cheap
        
        storage
        
        HDFS triple replicated
        
        Commodity hardware
        
        Processing
        
        Parallel via Map (local) and Reduce (aggregated)
  - - - Create a class
        
        Create a static Map class
        
        Create a static Reduce class
        
        Create a main function
        4.1 Create a job
        4.2 Job calls the Map and Reduce classes
    - - Object Oriented Programming used in a functional way.
        
        What is functional?!
        
        Type of code where the state is NOT shared.
    - - Word count
    - - Standard - usually written in Java
      - Hadoop streaming - Java base
      - Hadoop pipes - C++
    - - ls
        
        list folder contents
      - cat
        
        reads a file
      - mkdir
        
        makes a directory
      - cd
        
        changes directory
      - sudo command
        
        run command as admin
      - chmod file
        
        show/change permissions of a file
    - - hadoop fs -cat
      - hadoop fs -copyFromLocal
      - hadoop fs -put
      - sudo hadoop jar
      - hadoop fs -get
  - - - Each deamon is a separata JVM. No state shared.
- - - - Option A - Plain vanilla open source Hadoop
      - Option B - Vendor distribution
    - - Important to identify the latest stable release
    - - Option A - Local install (free but takes time)
      - Option B - local VM (must install virtualization sofware)
      - Option C - Cloud (costs money to test!)
    - - Local
        
        File system (single)
        
        HDFS (pseudo or Distributed)
      - Cloud
        
        Cloud files (S3, BLOB)
        
        HDFS
    - - MapReduce
        
        Version 1.0 OR 2.0?
      - Commonly min
        
        Hive
        
        Pig
  - - - SQL-like queries, creates batch (MapReduce) jobs
    - - SQL-like query, interactive process
    - - ETL-like scripting language
    - - ML algo (clustering, data mining, decision trees, etc.)
    - - Resilient, distributed data sets
    - - Complex event processing
- - - - security etc.
  - - - Yet Another Resource Negociator
      - layer between HDFS and MapReduce
      - Allows new type of processing, not limited to the heavy batch processing only.
      - Supports many frameworks. Hence MapReduce programming not required (?). Fits more business scenario's
    - - supports multiple MapReduce API's in a cluster
      - Distributed job life-cycle mgt
      - Scalability
      - Splits Job Tracker role into:
        
        resource managent
        
        job life-cycle management
      - batch or real-time processing
      - entreprise features
        
        security
        
        high availability
- - - - Need for speed
    - - Interactive Hive: 10x 100x faster
      - in-memory and columnstore