Please enable JavaScript.

Coggle requires JavaScript to display documents.

Big Data Concepts and Tools/WEBM (Challenges regarding Big Data Analytics,…

- - - - V's of Big Data
        
        Velocity (Speed)
        
        Rapidly generated data calls for faster processing
        Late Decision -> Missing Opportunities
        Online data analytics
        
        Veracity
        
        (Accuracy, Quality, Truthfulness/Trustworthiness)
        
        Variability
        
        Data Flows can be inconsistent with periodic peaks
        
        Value
        
        Provides Business Value
      - Variety (Complexity of Data)
        Structured Data - Tables, Transactions, Spreadsheets
        Semi-Structured Data - Emails, Logs, Documents
        Unstructured Data - GPS,Multi-media etc.
        Huge public domain data (To extract information, These data has to be linked)
- - - - Engineered to store and analyse large amount of dist., semi and unstructured data
      - Together with MapReduce = Big Data Core Technology
        
        How does it work
        
        Two Components
        
        Hadoop Distributed File system (HDFS)
        
        Map Reduce
        
        Map Reduce
        
        Aims to achieve high performance with simple computation
        Remarkable at processing and analysing massive volume of multi-structured data in a timely manner
        
        How does it work
        
        1 more item...
        
        Distributed the processing of the massive multi-structured data files across large cluster of ordinary machines/processors
        Used in index web for search, ML, text analysis etc.
        
        Semi-Centralised: Data is de-fragmented into parts and loaded into file system (cluster) resulting in multiple nodes (system)
        Each bits of data undergo several duplication and is loaded into the file system for replication and fail-safe processing
        
        Hadoop Cluster
        
        2 nodes
        
        Slave : Data Node and Task Tracker
        
        2 more items...
        
        Master : Name node and Job Tracker
        
        2 more items...
        
        Jobs are distributed to the clients, once the job is completed, the results are collated and aggregated using MapReduce
      - The clusters run on inexpensive commodity hardware capable of scaling
      - Technical Components
        
        HDFS
        
        Name Node (Primary Facilitator)
        
        Secondary Node (Backup to Name Node)
        
        Job Tracker
        
        Slaves Node
      - Co-existence of Hadoop and DW
        
        Use Hadoop for storing and archiving multi-structured data
        
        Use Hadoop for filtering, transforming and/or consolidating multi-structured data
        
        Use Hadoop to analyse large volumes of multi-structured data and publish the analytical results
        
        Use RDBMS that provides MapReduce capabilities as an inviestigative computing platform
        
        Use a front-end query tool to access and analyse data
        
        Raw Data Streams (E.g. sensor data, Blog, Images etc) -> File copy ->Hadoop (Extract,Transform) ->Dev Environments <=> Integrated Data Warehouse (Operational Systems from POS, CRM, SCM etc (ETL) DW) ->BI TOOLS
    - - Process large volumes of multi-structured data
      - Often works in conjunction with Hadoop
      - Serves discrete data stored among large vol of multi-structured data to end-users and BD applications