Please enable JavaScript.

Coggle requires JavaScript to display documents.

Week 8: Text & Web Analytics (Web Mining (KDD for Web Mining (Step 3:…

- - - - Structured VS Unstructured data
      - Data Mining: Structured data in databases
      - Text Mining: Unstructured Data eg. Word Docs, PDF Files, text excerpts, XML files etc
  - - - Increase cross-selling and up-selling by analyzing call-center data
      - Blogs, user reviews of products reveal user sentiments
      - Customer relationship management to increase overall lifetime value of customer
    - - Spam filtering
      - Deception detection
    - - DNA analysis, analysis of gene expression etc
    - - Retrieval of information to answer specific queries
  - - - Syntax ambiguity
      - Imperfect/irregular input (accents, intonation)
      - Speech acts
    - - A laboriously hand-coded database of English words, their definitions, sets of synonyms and various semantic relations between synonym sets
      - A major resource for NLP
      - Need automation to be completed
    - - A technique used to detect favorable and unfavorable opinions toward specific products and services
      - SentiWordNet
  - - - Collect all relevant unstructured data
      - Digitize, standardize the collection
      - Place the collection in a common place
    - - Goal: create TDM where the cells are filled with the most appropriate indices
    - - Classification (text categorization)
      - Clustering (natural groupings of text)
      - Association
        
        Confidence: % of documents that include concepts of C
        
        Support: % of documents that include both A & C
- - - - Authoritative pages
      - Hubs: List of recommended links to authoritative pages
      - Hyperlink-included topic search (HITS) algorithm
  - - - data stored in server access logs, referrer logs, agent logs and client-side cookies
      - user characteristics and usage profiles
      - metadata, such as page attributes, content attributes and usage data
  - - - Plausible goals for web-based mining
        
        Improve usability of web site - decrease the average no.of pages visited by a customer before a purchase transaction
        
        Personalise web pages for customers
        
        Determine products for sale at a web site that tend to be purchased or viewed together
    - - 1st issue: Differentiate individual user sessions
        
        Host addresses are useful because multiple users may access a site from the same host
        
        May be able to differentiate different user sessions by combining host addresses with the referring page
        
        Best able to differentiate user sessions of cookies (date files) are allowed to be placed on users' computers to contain session information
      - 2nd issues: Identify unwanted log file entries
        
        A single user page request oftentimes generates multiple log file entries several types of servers
        
        Must have a technique to identify unwanted log file entries so that they do mot become part of the session file
      - 3rd issue: Data transformation
        
        New attributes can be added to user session records to improve prediction of outcome
        
        eg. average purchase amount of repeat customers, Time of most recent transaction
    - - Modelling Techniques: Association Rule Mining & Unsupervised clustering
      - usage profiles for clusters
      - Form clusters of similar user session instances to personalise pages viewed by web site users
    - - Interpret results of a web-based mining session using association rule mining and clustering
    - - Targeted Marketing Communications
      - Webpage Usability Optimisation
      - Personalisation of web pages based on user profiles