Please enable JavaScript.
Coggle requires JavaScript to display documents.
Week 8: Text & Web Analytics (Web Mining (KDD for Web Mining (Step 3:…
Week 8: Text & Web Analytics
Text Mining
Concepts
85-90 % of all corporate data is in some kind of unstructured form
Unstructured corporate data is doubling in size every 18 months
Text Analytics: includes information retrieval, information extraction, data mining and web mining
Data Mining VS Text Mining
Both seek for novel and useful patterns
Both are semi-automated processes
Difference in the nature of the data
Structured VS Unstructured data
Data Mining: Structured data in databases
Text Mining: Unstructured Data eg. Word Docs, PDF Files, text excerpts, XML files etc
Applications
Marketing
Increase cross-selling and up-selling by analyzing call-center data
Blogs, user reviews of products reveal user sentiments
Customer relationship management to increase overall lifetime value of customer
Security Applications
Spam filtering
Deception detection
Biomedical
DNA analysis, analysis of gene expression etc
Academic
Retrieval of information to answer specific queries
Application Area
Information extraction: identification of key phrases and relationships within text by looking for patterns
Topic Tracking: Based on a user profile and documents, text mining can predict other documents of interest to the user
Summarization: to save time for the reader
Categorization: identifying the main themes of a document and placing it into a predefined set of categories
Clustering: group similar documents together
Concept Linking: connects related documents by identifying their shared concepts
Question Answering: finding best answer to a given question through knowledge-driven pattern matching
Terminology: Unstructured/semistructured data, Corpus, Terms, Concepts, Stemming, Stop Words, Synonyms, Tokenizing, Term dictionary, Word frequency, Part-of-speech tagging, Morphology, Term-by-document matrix, Singular value decomposition
Natural Language Processing (NLP)
Structuring a collection of text
A very important component in text mining
Considers grammatical and semantic constraints as well as context
Syntax ambiguity
Imperfect/irregular input (accents, intonation)
Speech acts
WordNet
A laboriously hand-coded database of English words, their definitions, sets of synonyms and various semantic relations between synonym sets
A major resource for NLP
Need automation to be completed
Sentiment Analysis
A technique used to detect favorable and unfavorable opinions toward specific products and services
SentiWordNet
Challenges of Text Mining
Part-of-speech tagging: depends not only n definition of term but also on the context used
Text segmentation: eg analysis of free-form text found in e-mails and recorded telephone transcripts
Syntactic ambiguities: eg. grammar ambiguity
Text contains acronyms, abbreviations, misspellings
Imperfect or irregular input eg. foreign accents
Speech acts and semantic analysis: understanding the meaning of words
Process
Step 1: Establish the corpus
Collect all relevant unstructured data
Digitize, standardize the collection
Place the collection in a common place
Step 2: Create the Term-by-Document Matrix
Goal: create TDM where the cells are filled with the most appropriate indices
Step 3: Extract patterns/knowledge
Classification (text categorization)
Clustering (natural groupings of text)
Association
Confidence: % of documents that include concepts of C
Support: % of documents that include both A & C
Sentiment Analysis
A settled opinion reflective of one's feelings
An integral part of CRM and customer experience management systems
May be categorized by topic
Answer the question: "What do people feel about a certain topic?"
Positive/Negative, Explicit/Implicit
Applications
Voice of customer (VOC) gets data from full set of customer touch points - emails, surveys, call center notes, social media postings
VOC is a key element of customer experience management initiates
Voice of the market(VOM): understanding aggregate opinions and trends
VOM is about knowing what stakeholders are saying about your products and services
Web Mining
lWeb is the largest repository of data
Data is in HTML, XML, text format
Web Usage Mining Applications
Determine the lifetime value of clients
Design cross-marketing strategies across products
Target electronic ads and coupons at user groups based on user access patterns
Predict user behavior based on previously learned rules and users' profiles
Present dynamic information to users based on their interests and profiles
Challenges
The Web is too big for effective data mining
The Web is too complex
The Web is dynamic
The Web is not specific to a domain
The Web has everything
process of discovering intrinsic relationships for Web data (textual, linkage/usage)
Content Mining
Source: unstructured textual content of the Web pages (usually in HTML format)
The extraction of useful information from Web pages (textual content)
Used for competitive intelligence, information/news/opinion collection, sentiment analysis and automated data collection
Structure Mining
Source: the unified resource locator (URL) links contained in the Web pages
The development of useful information from the links included in the Web documents
Web pages include hyperlinks
Authoritative pages
Hubs: List of recommended links to authoritative pages
Hyperlink-included topic search (HITS) algorithm
Usage Mining
Source: the detailed description of a Web site's visits
Extraction of information from clickstream analysis of web server logs generated through Web page visits and transactions
data stored in server access logs, referrer logs, agent logs and client-side cookies
user characteristics and usage profiles
metadata, such as page attributes, content attributes and usage data
Clickstream analysis of web server logs
Clickstream data
KDD for Web Mining
Step 1 : Business Understanding
Plausible goals for web-based mining
Improve usability of web site - decrease the average no.of pages visited by a customer before a purchase transaction
Personalise web pages for customers
Determine products for sale at a web site that tend to be purchased or viewed together
Step 2: Data Understanding
Step 3: Data Preparation
1st issue: Differentiate individual user sessions
Host addresses are useful because multiple users may access a site from the same host
May be able to differentiate different user sessions by combining host addresses with the referring page
Best able to differentiate user sessions of cookies (date files) are allowed to be placed on users' computers to contain session information
2nd issues: Identify unwanted log file entries
A single user page request oftentimes generates multiple log file entries several types of servers
Must have a technique to identify unwanted log file entries so that they do mot become part of the session file
3rd issue: Data transformation
New attributes can be added to user session records to improve prediction of outcome
eg. average purchase amount of repeat customers, Time of most recent transaction
Step 4: Modelling
Modelling Techniques: Association Rule Mining & Unsupervised clustering
usage profiles for clusters
Form clusters of similar user session instances to personalise pages viewed by web site users
Step 5: Evaluation
Interpret results of a web-based mining session using association rule mining and clustering
Step 6: Deployment
Targeted Marketing Communications
Webpage Usability Optimisation
Personalisation of web pages based on user profiles