NLP

Text to Features (Feature Engineering on text data)

Syntactic Parsing

Dependency Trees

Part of speech tagging

used for:

A.Word sense disambiguation

main approaches to NLP tasks
(how the Entity Identification
phase of extraction is performed.)

  1. Rule-based

tend to focus on pattern-matching or parsing

can often be thought of as "fill in the blanks" methods

are low precision, high recall, meaning they can have high performance in specific use cases, but often suffer performance degradation when generalized

  1. "Traditional" Machine Learning

training data - in this case, a corpus with markup

training a model on parameters, followed by fitting on test data (typical of machine learning systems in general)

feature engineering - word type, surrounding words, capitalized, plural, etc.

inference (applying model to test data) characterized by finding most probable words, next word, best category, etc.

semantic slot filling

  1. Neural Networks

Algoritmos

Bag of Words

⛔ some words are not weighted accordingly
⛔ ausência de significado de semântica
⛔ Ausência de contexto
⛔ stop words add noisy

🔥 TFIDF
Term Frequency - Inverse Documento Frequence

  • Wij = Fij * log (N / Dij)
  • Não precisa remover STOPWORD
  • Só precisa remover a pontuação e colocar em minúscula

Methods

Tokenization

  • ⛔ remove espaço (san francisco, new york)
  • ⛔ remove pontuação (problema dataset bio medico)

Stop Words Removal
(pre-definied list)
(stop word are noisy)

✅ safe removed

✅ freeing up database space

✅ improving processing time.

⚠There is no universal list of stop words.

⛔ The thing is stop words removal can
wipe out relevant information and
modify the context in a given sentence

Word normalization

Lemmatization

Stemming

✅ improving our performance

✅ playing => play
⛔ news => new

✅ running”, “runs” and “ran” => “run”
⛔ Caring => car

✅ consideration the context of the word
in order to solve other problems like disambiguation

⛔ demands more computational power

Topic modeling
(each document consists of a mixture of topics
and that each topic consists of a set of words)

Latent Dirichlet Allocation (LDA)

dropped image link

⛔affixes can create or expand new forms of the same word (called inflectional affixes), or
even create new words themselves

Explicit Semantic Analysis (ESA)
(how similar in meaning two words
or pieces of text are to each other)

Latent Semantic Analysis (LSA)

Classifiers

Naive Bayes
(Statistical point of view)
(linear classification)
supervised machine learning algorithm

Maximum Entropy
(a model which is as unbiased as possible)
(linear classification)

Support Vector Machines
(‘simple’ linear classification/regression algorithm)

Classifier Evaluation
(F-Core): compare 2 diferente classifier
F= 2pr / p+r
p: precision = correctly classified examples / total number of classified examplest
r: recal = number of correctly classified examples /
the actual number of examples in the training set

TRENDS

Semantic Brand Score (SBS)

Brand importance

Prevalence

Diversity

Connectivity

number of times a brand is directly mentioned

diversity of the words associated with the brand

brand ability to bridge connections between other words or groups of words

Sentence Segmentation

Predicting Parts of Speech for Each Token
(verb, adverb, noum, etc.)

Dependency Parsing
build a tree that assigns a single parent word to
each word in the sentence

predict the type of relationship that exists between those two words

Finding Noun Phrases

Named Entity Recognition (NER)