NLP
Text to Features (Feature Engineering on text data)
Syntactic Parsing
Dependency Trees
Part of speech tagging
used for:
A.Word sense disambiguation
main approaches to NLP tasks
(how the Entity Identification
phase of extraction is performed.)
- Rule-based
tend to focus on pattern-matching or parsing
can often be thought of as "fill in the blanks" methods
are low precision, high recall, meaning they can have high performance in specific use cases, but often suffer performance degradation when generalized
- "Traditional" Machine Learning
training data - in this case, a corpus with markup
training a model on parameters, followed by fitting on test data (typical of machine learning systems in general)
feature engineering - word type, surrounding words, capitalized, plural, etc.
inference (applying model to test data) characterized by finding most probable words, next word, best category, etc.
semantic slot filling
- Neural Networks
Algoritmos
Bag of Words
⛔ some words are not weighted accordingly
⛔ ausência de significado de semântica
⛔ Ausência de contexto
⛔ stop words add noisy
🔥 TFIDF
Term Frequency - Inverse Documento Frequence
- Wij = Fij * log (N / Dij)
- Não precisa remover STOPWORD
- Só precisa remover a pontuação e colocar em minúscula
Methods
Tokenization
- ⛔ remove espaço (san francisco, new york)
- ⛔ remove pontuação (problema dataset bio medico)
Stop Words Removal
(pre-definied list)
(stop word are noisy)
✅ safe removed
✅ freeing up database space
✅ improving processing time.
⚠There is no universal list of stop words.
⛔ The thing is stop words removal can
wipe out relevant information and
modify the context in a given sentence
Word normalization
Lemmatization
Stemming
✅ improving our performance
✅ playing => play
⛔ news => new
✅ running”, “runs” and “ran” => “run”
⛔ Caring => car
✅ consideration the context of the word
in order to solve other problems like disambiguation
⛔ demands more computational power
Topic modeling
(each document consists of a mixture of topics
and that each topic consists of a set of words)
Latent Dirichlet Allocation (LDA)
⛔affixes can create or expand new forms of the same word (called inflectional affixes), or
even create new words themselves
Explicit Semantic Analysis (ESA)
(how similar in meaning two words
or pieces of text are to each other)
Latent Semantic Analysis (LSA)
Classifiers
Naive Bayes
(Statistical point of view)
(linear classification)
supervised machine learning algorithm
Maximum Entropy
(a model which is as unbiased as possible)
(linear classification)
Support Vector Machines
(‘simple’ linear classification/regression algorithm)
✏Classifier Evaluation
(F-Core): compare 2 diferente classifier
F= 2pr / p+r
p: precision = correctly classified examples / total number of classified examplest
r: recal = number of correctly classified examples /
the actual number of examples in the training set
TRENDS
Semantic Brand Score (SBS)
Brand importance
Prevalence
Diversity
Connectivity
number of times a brand is directly mentioned
diversity of the words associated with the brand
brand ability to bridge connections between other words or groups of words
Sentence Segmentation
Predicting Parts of Speech for Each Token
(verb, adverb, noum, etc.)
Dependency Parsing
build a tree that assigns a single parent word to
each word in the sentence
predict the type of relationship that exists between those two words
Finding Noun Phrases
Named Entity Recognition (NER)