Please enable JavaScript.
Coggle requires JavaScript to display documents.
Text Analytics - Coggle Diagram
Text Analytics
Text Preparation
Tokenization
(S3 p12-13)
To break a stream of characters into tokens
Done by identifying token delimiters
Case Normalization
(S3 p14)
Text Normalization
(S3 p15-17)
Stemming
Removing suffixes from words to create a so-called root word
Lemmatization
Includes the meaning of the word when converting words to their base forms (Lemma)
Stopword Removal
(S3 p18-19)
Indexing
(S3 p22-30)
Term Weighting
Binary
Frequency-based
Generate word cloud
Normalized frequency
tf-idf
(term frequency–inverse document frequency)
Converting text to
Term Document Matrix (TDM)
Text Categorization
Supervised
Classification
Hand-coded Classifiers
(S5 p16)
Generative Classifiers
Naïve Bayes Model
(S5 p17-20)
Discriminative Classifiers
(S5 p21-26)
Decision Tree Classifier
Rocchio Classifier
Support Vector Machines (SVMs)
Unsupervised
Clustering
(S5 p56-59)
Kmeans
Topic Modeling
(S5 p69-74)
Latent Dirichlet Allocation (LDA)
Hard Classification
A document cannot be in more than one category
Soft Classification
A document may have more than one category
Dimension Reduction
(S5 p61-66)
Information Extraction
Parts of Speech (POS) Tagging
(S6 p24-27)
Rule-based
Brill’s tagger
Stochastic
Hidden Markov Models (HMM)
(S6 p26-27)
Using NLTK
A sequence labeling task
To determine grammatical category of a term
Name Entity Recognition (NER)
(S6 p13-17)
Identify key information and classifying into categories
Examples of categories:
Person
Organization
Place / location
Date / time
Using NLTK or spaCy
Shallow Parsing / Chunking
(S6 p28)
To identify phrases in a text
Rule-Based
Using CoreNLP
TokensRegex
(S6 p49-53)
using NLTK
Stochastic
Using NLTK or spaCy
Deep Parsing / Full Parsing
(S6 p29-33)
WordNet
(S6 p35-39)
Text Mining
Task of transforming unstructured text data into structured (numerical) data to answer business questions
Basic Use Cases
(S2 p13-28)
Extract “meaning” from unstructured text
Automatically put text into categories
Improve accuracy in predictive modeling or unsupervised learning
Identify specific or similar/relevant documents
Extract specific information from the text