Please enable JavaScript.
Coggle requires JavaScript to display documents.
Document similarity - Coggle Diagram
Document similarity
Words
Pipeline
Clean data
Tokenize
Normalise
Stopword removal
Set similarity
Jaccard Similarity
Sorensen-Dice
One hot encoding
Sparse data
Sparse representation
use unk for unknown words
doesn't include punctuation
doesn't account for number of times words are used
Bag of words/Term frequency
Dense Representaion
Sparse Representation
Sublinear TF scaling (apply log to TF)
Document frequency
inverse document frequency
apply log
tf-idf
Letter Sequence similarity
bigrams
trigrams
ngrams
Letter similarity