Please enable JavaScript.
Coggle requires JavaScript to display documents.
Text and Web Analytics (Web Mining (Challenges of processing Web Data (Web…
Text and Web Analytics
Text Mining
Concepts
- Unstructured corporate data is doubling in size every 18 months
- Tapping into these information sources is not an option, but a need to stay competitive
- Text Mining: A semi-automated process of extracting knowledge from unstructured data sources
- Text Analytics: Includes information retrieval, information extraction, data mining, and web mining
Applications
1. Marketing
- Increase cross-selling and up-selling by analyzing call-centre data
- Blogs, user reviews of products reveal user sentiments
- Customer relationship management to increase the overall lifetime value of customer
2. Security Applications
- Spam filtering
- Deception detection
3. Biomedical
- DNA analysis, analysis of gene expression, etc.
4. Academic
- Retrieval of information to answer specific questions
Application Areas
1. Information extraction
- Identification of key phrases and relationships within the text by looking for patterns
2. Topic tracking
- Based on a user profile and documents, text mining can predict other documents of interest to the user
3. Summarization
- Saves time for the reader
4. Categorization
- Identify the main themes of a document and place it into a predefined set of categories
5. Clustering
- Group similar documents together
6. Concept linking
- Connects related documents by identifying their shared concepts
7. Question answering
- Find the best answer to a given question through a knowledge-driven pattern matching
Terminologies
- Unstructured / Semi-structured data
- Corpus / Corpora
- Terms
- Concepts
- Stemming
- Stop words / Include words
- Synonyms / Polysemes
- Tokenizing
- Term dictionary
- Word frequency
- Part-of-speech tagging
- Morphology
- Term-by-document matrix (Occurrence matrix)
- Singular value decomposition (Latent semantic indexing)
Process
Step 1: Establish the corpus
- Collect all relevant unstructured data
- Digitize, standardize the collection
- Place the collection in a commonplace
Step 2: Create the Term-by-Document Matrix
- Goal: To create TDM where cells are filled with the most appropriate indices
- Terms included: Stop words, Include words; Synonyms, Homonyms; Stemming (reduction of words to their roots)
- Best representation: Row counts; Binary frequencies; Log frequencies
- Reduce dimensionality of TDM: Manually or by filtering terms
Step 3: Extract patterns/knowledge
- Classification (text categorization)
- Clustering (natural groupings of text)
- Association (Confidence/Support)
- Trend Analysis
-
-
Web Mining
-
The process of discovering intrinsic relationships from Web Data (textual, linkage, or usage)
Web Context Mining
- Extraction of useful information from Web pages
- Data collection via web crawlers
- Used for competitive intelligence, information/news/opinion collection, sentiment analysis, and automated data collection
- Source: Unstructured textual content of the Web pages (usually in HTML format)
Web Structure Mining
- Development of useful information from the links included in Web documents
- Source: The URL links contained in the Web pages
Web Usage Mining
- Extraction of information from click-stream analysis of web server logs generated through web page visits and transactions
- Source: Detailed description of a Web site's visits (sequence of clicks by sessions)
KDD for Web Mining
Step 1: Business Understanding
- Plausible goals for web-based mining
- Improve the usability of a website
- Personalize web pages for customers
- Determine products for sale at a website that tend to be purchased or viewed together
Step 2 & 3: Data Understanding and Data Preparation
Issues with Data Preparation:
- 1st Issue: Differentiate individual user sessions
- Host addresses are not useful due to multiple users accessing a site from the same host
- May be able to differentiate different user sessions by combining host addresses with the referring page
- Best able to differentiate user sessions if cookies (data files) are allowed to be placed on users' computers to contain sessions information
- 2nd Issue: Identify unwanted log file entries
- A single user page request oftentimes generates multiple log file entries from several types of servers
- Must have a technique to identify unwanted log file entries so that they do not become part of the session file
- 3rd Issue: Data transformation
- New attributes can be added to user session records to improve prediction of outcome
Step 4: Modeling
- Association Rule Mining
- Unsupervised clustering
- Usage profiles for cluster
- Form clusters of similar user session instances to personalise pages viewed by website users
Step 5: Evaluation
- Interpret results of a web-based mining session using association rule mining and clustering
- Summary statistics of website activities complement the interpretation and evaluation of data mining results
- Statistics can be produced by Web server log analyzers
Step 6: Deployment
- Targeted Marketing Communications
- Set up online advertising promotions
- Send e-mail to promote products (to a group of likely interested registered customers)
- Modify the content of the website (by grouping products likely to be purchased together)
- Webpage Usability Optimisation
- Adapt the indexing structure of a website to better reflect the paths followed by typical users and the changes in needs of users over time
- Personalisation of web pages based on user profiles
- Implement strategies to personalize web pages
- Manual: force users to register
- Data mining to automate personalization based on user behaviour