Please enable JavaScript.

Coggle requires JavaScript to display documents.

Business Intelligence (week 7-8) (Text and Web Analytics (Web Mining (KDD…

- - - - 2. Variety (Complexity)
        
        Structured data e.g. relational / spreadsheet data
        
        Semi-structured data (e.g. email / document)
        
        Unstructured data ( e.g. video, image)
        -> All these data linked together = To extract knowledge
      - 3. Velocity (Speed)
        
        Data needs to be processed fast
        
        Late decisions -> Missing opportunities
        
        e.g. Healthcare monitoring = Monitor activities in body -> Abnormal reaction -> Require immediate help
      - 1. Volume (Scale)
        
        Large amount of data
        
        Generated Data has been increasing exponentially
    - - 5.Variability
        
        Data flows can be inconsistent with periodic peaks
      - 6.Value
        
        Provide business value
      - 4.Veracity
        
        Quality, Accuracy
  - - - Goal: Achieve high performance with "simple" computers.
        
        Good at processing large volume of multi-structured data in a timely manner.
        
        Used in indexing the web for search, graph, text analysis and machine learning.
    - - A new style of database which process large volume of multi-structured data.
        
        Often work in conjunction with Hadoop.
        
        E.g. Cassandra
    - - Definition: An open source framework for storing and analyzing massive amount of distributed, semi and unstructured data.
        
        Open Sources = Hundred of contributors continuously improve the core technology.
        
        Run on inexpensive commodity hardware so project can scale-out in expensively.
        
        MapReduce + Hadoop = Big data core technlogy.
      - How does it works
        
        Distributed with some centralization:
        Data is broken up into "parts" which are then loaded into a file system (clusters), made up of multiple nodes (machines).
        
        Each part is replicated multiple times and loaded into the file system for replication and failsafe processing
        
        Consist of 2 components
        
        Hadoop distributed file system (HDFS)
        
        MapReduce
        
        Jobs are distributed to the clients and once completed, the results are collected and aggregated using MapReduce.
      - Co-existence with DW
        
        Uses Hadoop for filtering, transforming and consolidating multi-structured data
        
        Use Hadoop to analyze the data and publish the analytical results.
        
        Uses Hadoop for storing and archiving multi-structured data
    - - e.g. Cloudera, MapR, Hortonworks
  - - - E-commerce
        
        Use of click stream data to make product recommendations and bundles
      - Smart lamp posts
        
        Sensor for humidity, rainfall, temperature.
        
        Track criminals
        
        Detect suspicious activities
- - - - Web Structure Mining
        
        The development of useful information from the links included in the web documents.
        
        Web page includes hyperlink e.g. authoritative pages, hubs.
      - Web Usage mining (Web analytics)
        
        Extraction of information from click-stream analysis of web server log generated through web page visit and transaction.
        
        Gather click-stream data and click-stream analysis of web server log.
      - Web Content Mining
        
        The extraction of useful information from web page
        
        Data collection via web crawlers.
        
        Used for competitive intelligence, sentiment analysis and information collection.
    - - Step 4: Modelling
        
        Association Rule Mining
        
        Unsupervised Clustering e.g. usage profile
      - Step 5: Evaluation
        
        Interpret results of a web page mining session using association rule mining and clustering
        
        E.g. if p4 &p5, then p10
        
        Using confidence and support
      - Step 2 & 3: Data Understanding and Preparation
        
        3 main Issues with data preparation
        
        Identify unwanted log files entries
        
        A single page request usually generate multiple log files entries from several type of server.
        
        A technique should be used to identify unwanted log file.
        
        Data transformation
        
        New attributes can be added to user session record to improve prediction of outcome
        
        e.g. average purchase amount of repeat customer.
        
        Differentiate individual user session
        
        Host address are not useful as multiple user may access a site from the same host
        
        Best solution to differentiate user session:cookie to be placed to contain session information
      - Step 6: Deployment
        
        Targeted marketing communication by sending email and set up online advertisement to promote product.
        
        Personalisation of web page based on user profile by manual ways or data mining
      - Step 1: Business Understanding
        Goals for web based minings:
        
        Improve usability of web site by decreasing the number of pages visited by a customer before a purchase transaction.
        
        Personalize web page for customers.
    - - e.g. Amazon
        
        Good recommendation of product or services
        
        Optimized web navigation.