Please enable JavaScript.
Coggle requires JavaScript to display documents.
Text and Web Analytics (Web Mining (KDD for Web Mining (Business …
Text and Web Analytics
Text Mining
-
Text Analytics
includes information retrieval, information extraction, data mining and web mining
Text Mining Applications
- Marketing
- Increase cross-selling and up-selling by analyzing call-center data
- Blogs, user reviews of products reveal user sentiments
- Customer relationship management to increase overall lifetime
value of customer
- Security Applications
- Spam filtering
- Deception detection
- Biomedical
- DNA analysis, analysis of gene expression etc
- Academic
- Retrieval of information to answer specific queries
Text Mining Application Area
- Information Extraction - identification of key phrases and relationships within text by looking for patterns
- Topic tracking - Based on a user profile and documents,text mining can predict other documents of interest to the user
- Summarization - to save time for the reader
- Categorization - identifying the main themes of a
document and placing it into a predefined set of categories
- Clustering - group similar documents together
- Concept linking - connects related documents by
identifying their shared concepts
- Question answering - finding best answer to a given question through knowledge-driven pattern matching
-
Web Mining
- Web is the largest repository of data
- Data is in HTML, XML, text format
- is the process of discovering intrinsic relationships from web data (textual, linkage, or usage)
Web usage mining applications
- Determine the lifetime value of clients
- Design cross-marketing strategies across products.
- Target electronic ads and coupons at user groups based on
user access patterns
- Predict user behavior based on previously learned rules and users' profiles
- Present dynamic information to users based on their interestsa nd profiles
Web Mining Challenges
- The Web is too big for effective data mining
- The Web is too complex
- The Web is too dynamic
- The Web is not specific to a domain
- The Web has everything
KDD for Web Mining
- Business Understanding
- Plausible goals for web-based mining:
- Improve usability of web site - decrease the average no.of pages visited by a customer before a purchase transaction
- Personalize web pages for customers
- Determine products for sale at a web site that tend to be purchased or viewed together
- Data Understanding
Typical Web Server Log File:
- User’s host address
- Date & Time
- Request
- Status
- Bytes
- Referring Page
- Browser Type
- Data Preparation
Issues with data preparation
- 1st issue: Differentiate individual user session
- Host addresses are not useful because multiple users may access a site from the same host
- May be able to differentiate different user sessions by combining host addresses with the referring page
- Best able to differentiate user sessions if cookies
(data files are allowed to be placed on users’
computers to contain session information
- 2nd issue: Identify unwanted log file entries
- A single user page request oftentimes
generates multiple log file entries from several
types of servers (e.g. image servers)
- Must have a technique to identify unwanted log
file entries so that they do not become part of
the session file
- 3rd issue: Data transformation
- New attributes can be added to user session records to improve prediction of outcome
- Examples: Average purchase amount of repeat customers & Time of most recent transaction
- Modelling
Modeling Techniques for web-based mining:
- Association Rule Mining
- Unsupervised clustering:
Usage profiles for clusters
Form clusters of similar user session instances to personalise pages viewed by web site users
- Evaluation
Interpret results of a web-based mining session using association rule mining and clustering
- Association Rule generated:
IF P4 & P10 THEN P15
(Confidence: ¾ = 75%, Support: ¾ = 75%)
- Interpretation:
- Webpage Usability Optimisation – If direct links do not exist between the 3 pageviews, web site indexing structure can be modified to place one or more direct linksbetween the pages
- Summary statistics of Web site activities complement the interpretation and evaluation of data mining results
- Statistics can be produced by Web server log
analyzers (e.g. awstats, webalizer, etc)
How often a Web site is visited
How many times an individual fill a shopping cart but fail to complete transaction
Which web site products are best and worst sellers
Text Mining Process
Step 1: Establish the corpus
- Collect all relevant unstructured data (e.g., textual
documents, XML files, emails, Web pages, short notes,
voice recordings…)
- Digitize, standardize the collection (e.g., all in ASCII text
files)
- Place the collection in a common place (e.g., in a flat file,
or in a directory as separate files)
Step 2: Create the Term-by-Document Matrix
Goal : to create TDM where the cells are filled with the
most appropriate indices
- Should all terms be included?
- Stop words, include words
- Synonyms, homonyms
- Stemming : reduction of words to their roots eg modeling,
modeled, models, etc
- What is the best representation of the indices (values in cells)?
- Row counts; binary frequencies; log frequencies;
- How can we reduce the dimensionality of the TDM?
- Manual - a domain expert goes through it
- Eliminate terms with very few occurrences in very few
documents (Filtering)
Step 3: Extract patterns / Knowledge
- Classification (text categorization)
- Clustering (natural groupings of text)
- Improve search recall & precision
- Association (eg A=> C)
- Confidence: % of documents that include concepts of C
- Support : % of documents that include both A & C
- Trend Analysis
-
- Unstructured or semistructured data
- Corpus (and corpora)
- Terms
- Concepts
- Stemming
- Stop words
- Synonyms
- Tokenizing
- Term Dictionary
- Word frequency
- Part of speech tagging
- Morphology
- Term-by-document matrix - Occurrence matrix
- Singular value decomposition - Latent semantic indexing
WordNet
- A laboriously hand-coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets
- A major resource for NLP
- Need automation to be completed
Sentiment Analysis
- A technique used to detect favorable and unfavorable opinions toward specific products and services
- SentiWordNet
- A settled opinion reflective of one’s feelings
- An integral part of CRM and customer
experience management systems
- May be categorized by topic
- Answers the question : “What do people feel
about a certain topic?”
P- ositive or negative, explicit or implicit
- A lexicon (catalog of words, their synonyms and
meanings) is created eg Wordnet
Sentiment analysis application
- Voice of the customer (VOC) gets data from full set of customer touch points – emails,surveys,call center notes, social media postings
- VOC is a key element of customer experience
management initiates
- Voice of the market (VOM) : understanding
aggregate opinions and trends.
- VOM is about knowing what stakeholders are
saying about your products and services.
- Web Content Mining
Source: Unstructured textual content of the web pages (usually in HTML format)
- The extraction of useful information from Web pages (textual content)
- Data collection via web crawlers eg Googlebot
- Used for competitive intelligence, information/news/opinion collection,
sentiment analysis, and automated
data collection
- Used for competitive intelligence, information/news/ opinion collection, sentiment analysis, and automated data collection
- Web Structure Mining
Source: the unified resource locator (URL) links contained in the Web pages
- The development of useful information from the links included in the web documents
- Web pages include hyperlinks
- Authoritative pages
- Hubs : List of recommended links to authoritative pages
- Hyperlink-induced topic search (HITS) algorithm
- Web usage Mining
Source: the detailed description of a web site's visits (sequence of clicks by sessions)
- Extraction of information from clickstream analysis of web server logs generated through web Page visits and transactions
- data stored in server access logs, referrer logs, agent logs, and client-side cookies
- User characteristics and usage profiles
- metadata, such as page attributes, content attributes, and usage data
- Clickstream data
- Clickstream analysis of web server logs
- Deployment
Possible actions to take based on results of web-based mining:
Targeted Marketing Communications
- Set up online advertising promotions
- Send e-mail to promote products of likely interest to a select group of registered customers
- Modify content of a Web site by grouping products likely to be purchased together
Webpage Usability Optimisation:
- Adapt the indexing structure of a Web site to better reflect the paths followed by typical users and the changes in needs of users over time.
Personalisation of web pages based on user profiles:
- Implement strategy to personalize web pages
- Manual : force users to register at web site
- Data mining to automate personalisation based on actual user behaviour
-
Data Preparation extracts relevant data from web server logs and create a session file suitable for data mining