Please enable JavaScript.
Coggle requires JavaScript to display documents.
Finding Similar Items (Shingling (example (k=2, bag=false (abcab (ab…
Finding Similar Items
Shingling
convert documents
into sets
k-gram
short: for short document e.g. k=5
long: for long document e.g. k=10
Similarity metric
Jaccard similarity
number of similar items in document /
total number of items in document
example
k=2, bag=false
abcab
ab
hash(ab)
1
ca
hash(ca)
7
bc
hash(bc)
5
k=2, bag=true
abcab
bc
ca
ab
ab
Min Hashing
convert sets
into signatures
while preserve
similarity
pipeline
encoding sets as bit vector
find similar columns, small signature
Local Sensitive Hashing
focus on pairs of signatures likely to be similar