Week 8: Text and Web Analytics

Data Mining versus Text Mining

Data Mining: Structured data in databases

Text Mining: Unstructured data e.g. word documents, PDF files, text excerpts, XML files, and so on.

Similarities

Differences: Nature of data

Seek for novel and useful patterns

Semi-automated process

Texting Mining Applications

Security Applications

Biomedical

Marketing

Academic

Text Mining Application Area

Summarization

Categorization

Topic tracking

Clustering

Information extraction

Concept linking

Question aswering

Text Mining Terminology

Concepts

Stemming

Terms

Stop words (and include words)

Corpus (and corpora)

Synonyms (and polysemes)

Unstructured or semistructured data

Tokenizing

Term dictionary

Word frequency

Part-of-speech tagging

Morphology

Term-by-document matrix

Singular value decompostition

Natural Language Processing (NLP)

A very important component in text mining

A subfield of artificial intelligence and computational linguistics

Structuring of a collection of text

The study of understanding the natural human language

Considers grammatical and semantic constraints as well as context

Challenges

Text segmentation

Syntactic ambiguities

Part-of-speech tagging

Imperfect or irregular input

Speech acts and semantic analysis

WordNet

Sentiment Analysis

Need automation to be completed

Text Mining Process

Step 2: Create the term-by-Document Matrix (TDM)

Step 3: Extract patterns/knowledge

Step 1: Establish the corpus

Web Mining

Process of discovering intrinsic relationships from Web data (textual, linkage, or usage)

KDD for Web Mining

Step 3: Data Preparation

Step 4: Modelling

Step 5: Evaluation

Step 2: Data Understanding

Step 6: Deployment

Step 1: Business Understanding