Please enable JavaScript.
Coggle requires JavaScript to display documents.
Week 8: Text and Web Analytics (Web Mining (3 Types of Web Mining (Web…
Week 8: Text and Web Analytics
Text Mining Concepts
85-90 percent of all corporate data is in some kind of
unstructured form (e.g., text)
Unstructured corporate data is doubling in size every 18
months
Tapping into these information sources is not an option, but a need to stay competitive
Answer: text mining
A semi-automated process of extracting knowledge from unstructured data sources
a.k.a. text data mining or knowledge discovery in textual databases
Text Analytics: includes information retrieval, information
extraction, data mining and web mining
Data Mining vs Text Mining
Both seek for novel and useful patterns
Both are semi-automated processes
Difference is the nature of the data:
Data Mining : Structured data in databases
Text Mining : Unstructured data eg Word documents,
PDF files, text excerpts, XML files, and so on
Text Mining Application
Text Mining Application Area
Information extraction: identification of key phrases and
relationships within text by looking for patterns
Topic tracking: Based on a user profile and documents, text mining can predict other documents of interest to the user
Summarization: to save time for the reader
Categorization: identifying the main themes of a document and placing it into a predefined set of categories
Clustering: group similar documents together
Concept linking: connects related documents by identifying their shared concepts
Question answering: finding best answer to a given question through knowledge-driven pattern matching
Text Mining Applications
Marketing
Increase cross-selling and up-selling by analyzing call-center data
Blogs, user reviews of products reveal user sentiments
Customer relationship management to increase overall lifetime value of customer
Security Applications
Spam filtering
Deception detection
Biomedical
DNA analysis, analysis of gene expression etc
Academic
Retrival of information to answer specific queries
Text Mining Terminology
Unstructured/semistructured data
Corpus (and corpora)
Terms
Concepts
Stemming
Stop Words (And include words)
Synonyms (and polysemes)
Tokenizing
Term dictionary
Word frequency
Part-of-speech tagging
Morphology
Term-by-document matrix
Occurrence matrix
Singulr value decomposition
Latent semantic indexing
Natural Language Processing (NLP)
Structuring a collection of text
Old Approach: Bag-of-words (classify text-based documents into 2 or more predetermined classes or to cluster them into natural groupings)
New Approach: Natural language processing
A very important component in text mining
A subfield of artificial intelligence and compuational linguistics
The study of "understanding" the natural human language
Considers grammatical and semantic constraints as well as context:
Syntax ambiguity
Imperfect or irregular input (accents, intonation)
Speech acts (e.g. pass me the salt, made a pass)
WordNet
A laboriously hand-coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets
A major resource for NLP
Need automation to be completed
Sentiment Analysis
A technique used to detect favorable and unfavorable opinions toward specific products and services
SentiWordNet
Challenges of Text Mining (NLP)
Part-of-speech tagging: Depends not only on definition of term also on the context used
Text segmentation: e.g. analysis of free-form text found in e-mails and recorded telephone transcripts
Syntatic ambiguities e.g. grammar ambiguity
Text contains acronyms, abbreviations, misspellings. e.g. customer, cust, customar, csmr
Imperfect or irregular input e.g. foregin accents
Speech acts and semantic analysis: understanding the meaning of words (i.e. book = to reserve vs book = a manual)
Text Mining Process
Step 1: Establish the corpus
Step 2: Create the Term-by-Document Matrix (TDM)
Step 3: Extract patterns/knowledgeClassification (text categorization)
Clustering (natural groupings of text)
Improve search recall & precision
Association (eg A => C)
Confidence: % of documents that include concepts of C
Support : % of documents that include both A & C
Trend Analysis
Sentiment Analysis
What is it?
A settled opinion reflective of one's feelings
An integral part of CRM and customer experience management systems
May be categorized by topic
Answers the question : "What do people feel about a certain topic?"
Positive or negaitve, explicit or implicit
A lexicon (catalog of words, theyir synonymns and meanings) is created e.g. Wordnet
Sentiment Analysis Applications
Voice of the customer (VOC) gets data from full set of customer touch points - emails, surveys, call center notes, social media postings
VOC is a key element of customer experience management initiates
Voice of the market (VOM) : understanding aggregate opinions and trends
VOM is about knowing what stakeholders are saying about your products and services
Web Mining
Overview
Web is the largest repository of data
Data is in HTML, XML, text format
Web usage mining applications
Determine the lifetime value of clients
Design cross-marketing strategies across products
Target electronic ads and coupons at user groups based on user access patterns
Predict user behavior based on previously learned rules and users' profiles
Present dynamic information to users based on their interests and profiles
Web Mining is the process of discovering intrinsic relationships from Web data (textual, linkage, or usage)
Challenges
Challenges (of processing Web data)
The web is too big for effective data mining
The Web is too complex
The Web is too dynamic
The Web is not specific to a domain
The Web has everything
Opportunities and challenges are great
3 Types of Web Mining
Web Structure Mining
Source: The unified resource locator (URL) links contained in the Web Pages
The development of useful information from the links included in the Web documents
Web pages include hyperlinks
Authoritative pages
Hubs: List of recommended links to authoritative pages
Hyperlink- induced topic search (HITS) algorithmm
Web Usage Mining
Source: The detailed description of a Web site's visits (sequence of clicks by sessions)
Extraction of information from clickstream analysis of web server logs generated through Web page visits and transactions
Data stored in server access logs, referrer logs, agent logs and client-side cookies
User characterisitics and usage profiles
Metadata, such as page attributes, content attributes, and usage data
Clickstream data
Clickstream analysis of web server logs
Web Content Mining
Source: Unstructured textual content of the Web Pages (Usually in HTML format)
The extraction of useful information from Web pages (textual content)
Data collection via web crawlers e.g. Googlebot
Used for competitive intelligence, information/news/opinion collection, sentiment analysis, and automated data collection
KDD processes
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Modeling Techniques for web-based mining:
Association Rule Mining
Unsupervised Clustering
Usage profiles for clusters
Form clusters of similar user session instances to personalise pages viewed by web site users