Please enable JavaScript.

Coggle requires JavaScript to display documents.

Week 7: Big Data Concepts and Tools (Characteristics of Big Data (Variety…

- - - - Hadoop Cluster
        
        2 Nodes
        
        Master : Name Node and Job Tracker
        
        Slave : Data Node and Task Tracker
        
        Data Nodes referred to as storage node where the data is stored
        
        Name node keeps track of the files and directories and provides information on where in the cluster data is stored and whether any nodes have failed
        
        Job and Task tracker are for processing data and are known as compute node
        
        Job Tracker initiates and co-ordinates jobs or the processing of data and dispatches compute tasks to the Task Tracker
      - Demystifying Facts
        
        Hadoop consists of multiple products
        
        Hadoop is open source but available from vendors too
        
        Hadoop is an ecosystem, not a single product
        
        HDFS is a file system, not a DBMS
        
        Hadoopand MapReduce are related but not the same
        
        MapReduce provides control for analytics
        
        Hadoop is about data diversity, not just data volume
        
        Hadoop complements a DW; it’s rarely a replacement
        
        Hadoop enables many types of analytics, not just Web analytics
      - Hadoop Technical Components
        
        Hadoop Distributed File System (HDFS)
        
        Name Node (primary facilitator)
        
        Secondary Node (backup to Name Node)
        
        Job Tracker
        
        Slave Nodes (the grunts of any Hadoop cluster)
        
        Additionally, Hadoop ecosystem is made up of a number of complementary sub-projects: NoSQL (Cassandra, Hbase), DW (Hive), …
        
        NoSQL = not only SQL
        
        NoSQL (Not Only SQL)
        
        A new style of database which process large volumes of multi-structured data
        
        Often works in conjunction with Hadoop
        
        Serves discrete data stored among large volumes of multi-structured data to end-users and Big Data applications
        
        Examples : Cassandra, MongoDB, CouchDB, Hbase, etc
      - How Does Hadoop Work?
        
        Consists of 2 components : Hadoop Distributed File System (HDFS) and Map Reduce
        
        Distributed with some centralization: Data is broken up into “parts,” which are then loaded into a file system (cluster) made up of multiple nodes (machines)
        
        Each “part” is replicated multiple times and loaded into the file system for replication and failsafe processing
        
        Jobs are distributed to the clients, and once completed, the results are collected and aggregated using MapReduce
    - - What is the impact of Big Data on DW?
        
        Big Data and RDBMS do not go nicely together
        
        Will Hadoop replace data warehousing/RDBMS?
        
        Use Cases for Hadoop
        
        Hadoop as the repository and refinery
        
        Hadoop as the active archive
        
        Use Cases for Data Warehousing
        
        Data warehouse performance
        
        Integrating data that provides business value Interactive BI tools
      - Use Hadoop for storing and archiving multistructured data
        
        Use Hadoop for filtering, transforming, and/or consolidating multi-structured data
        
        Use Hadoop to analyze large volumes of multistructured data and publish the analytical results
        
        Use a relational DBMS that provides MapReduce capabilities as an investigative computing platform
        
        Use a front-end query tool to access and analyze data