Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost CH 6 (Similarity (Data Mining (Grouping things by similarity or…
Provost CH 6
Similarity
Data Mining
Grouping things by similarity or searching for the right sort of similarity
May want to retrieve similar things directly
Classifying and regression applications
Similar things into clusters
How can a customer be like a movie?: "Taste dimensions"
Both classification trees and linear classifiers establish boundaries of differing classifications
Need a basic method for measuring similarity:
Euclidean Distance
pythagorean theorem
Essentially we can compute the overall distance by computing the distance of the individual dimensions or dividial features
Can do n'th dimensions = General Euclidean distance
Heterogeneous Attributes:
More complication added
multiple values for a categorical attribute this will not be good enough
Nearest neighbor based systems have variable scaling front ends for this reason
Clusters
Finding groups of objects that differ with respect to some target characteristic of interest
Do my customers fall into different groups
Process of finding natural groupings: unsupervised segmentation or simply clustering
Hierarchical clustering
group points by their similarity
only overlap is when one cluster contains other clusters
Starts from the points level up to one category that holds all the points
Bottom graph is: dendrogram
Shows explicitly the hierarchy of the clusters
Nearest neighbor clustering
centroids
Called the k-means
arithmetic means (averages) of the values along each dimension for the instances in the cluster
k = the number of clusters in the data
Prior to clustering= data preparation
The "Dictionary of Distances by Deza and Deza
Tons of functions
Edit distance or the Levenshtein metric
how many edits of letters or numbers would it take to change one string to another
Neighbors
Nearest neighbor reasoning
uses nearest neighbors for predictive analysis
No simple answer to how many neighbors should be used (k-NN) depends on the problem
weighted voting or similarity moderated voting
the influence it has
Can conduct cross-validation or other nested holdout testing on the training set for a variety of different values of 'k', searching for the one that preforms the best on the model
Classification
target variables
classes
weight of the variable on the data
Can chose the weight so that variables have more predictive ability
It's important not just to classify a new example but to estimate it's probability
Issues:
Some fields of work nearest neighbor will work and in others it wont
the justification of a specific decision vs the intelligibility of an entire model
aka will work for some things on a single case basis or more broadly in others
curse of dimensionality
irrelevant variables
asking what variables are relevant