Please enable JavaScript.
Coggle requires JavaScript to display documents.
ANL312 Text Mining and Applications - Coggle Diagram
ANL312 Text Mining and Applications
Business Analytics research
what Business Analytics research involves
Collect Data
Method
Application
sources of information
Other Organisations
Employer?
Self Collect data
Types of info
Primary Literature
*
(PREFERRED)
Secondary Literature
People (Experts in field)
Organisations (Get Data)
Ways to Search
Online
General search engines (e.g. Google, Yahoo, and Bing)
Academic search engines (e.g. Google Scholar and CiteSeerX
FAQs ( Frequently Asked Questions and forums
Offline
Library (e.g. books and archives)
Strategy
report writing skills
APA
style to credit
guidelines of report writing
(See comments)
text mining
What
Unstructured Data to Structured Data
Terminology
Text
Corpus
Natural Language Processing
Parsing
This is the thing I wanted to learn!!!
IN SPSS MODELLER
1. Tokens (lowest)
PHASE 1
Reading Text Files
collections of characters or abbv. separated by spaces
Terms
PHASE 2
: TEXT PHARSING
Parts of speech (POS) Tagging
Stemming
Synonyms
Exclude List
tokens after NLP used to perform parts of speech tagging
Catagories (TOP)
PHASE 3
: Categorising text
Similar concepts
Concepts
PHASE 2
: TEXT PHARSING (cont.)
Concept and Type Extraction
single or multiword terms after filtering non-informative terms via POS tags
Importance
(Various uses, not exhaustive)
Information to be gleaned
Convert unstructured to structured
Concept categorization
Document classification
Topic modelling
Sentiment analysis
Document clustering
90% data is not structured!!
WHERE? Examples
Can provide valuable info!!
Challenges
No Standard Way to write text
Abbreviations and Shortforms
Make interpretation harder
Spelling Errors
Synonyms
Redundant Terms
process flow
Data Collection
#
Data Parsing
Determining the unit of text analysis
Sentence Level
Paragraph Level
Document Level
Tokenisation
is a process of breaking up the units of text, for instance sentences or paragraphs, into individual tokens/words.
Removing stop words
Not always needed
If analysing sentence structure, may need to keep item
#
Removes common words ä, is, on, the etc
Stemming
Maps all variants of word to root word
Example: Promote
(promoting, promotion, promotes)
Spelling normalisation
corrects mispelled words
Case normalisation
Case Normalisation converts the entire doc to upper or lower case. (So "A" and "a" becomes the same word)
Part of speech tagging
Available for SPSS modeller
Important to filter away non-informational tags
Example: DT (Determiner: an, any, the, this)
Notes on sequencing?
Link on sequencing (
Link
)
3.Text Filtering
Transformation/
Vectorisation
Functioning Analysis
CRISP DM framework
Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment