Ch 6 Similarity and Clusters (Nearest Neighbor reasoning (Important issues…
Ch 6 Similarity and Clusters
Fundamental concepts: Calculating similarity of objects described by data; Using similarity for prediction; Clustering as similarity-based segmentation
Exemplary techniques: Searching for similar entities; Nearest neighbor methods; Clustering methods; Distance metrics for calculating similarity
Similarity and Distance
Euclidean distance - common measure of distance between two points
Used for vector comparison, the similarity of one pair of instances to that of another pair
Nearest Neighbor reasoning
A data analysis task used to compare most-similar instances.
important to recognize if category values are mutually exclusive or not
How many neighbor should be used?
Should they have equal weights in the combining function?
Curse of dimensionality: Having too many attributes being factored into the distance calculations can confuse and mislead instance similarities.
One way to mitigate is proper feature selection: the judicious determination of features that should be included in the data mining model
Scaling concerns, $10 of income compared to 10 years of life.
Estimating probability as opposed to choosing the majority provides better results for predictability
Centroid: center of a cluster used for defining a cluster
Hierarchical, creates collections of ways to group points not just unique clusters, allows you to see data "landscape"
K means, the mean part is the centroid of a cluster, calculated by averaging all the y values then averageing all the x values for the coordinates. K is the number of clusters one wishes to find in the data.