Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch 6 Similarity and Clusters (Nearest Neighbor reasoning (Important issues…
Ch 6 Similarity and Clusters
Fundamental concepts: Calculating similarity of objects described by data; Using similarity for prediction; Clustering as similarity-based segmentation
Exemplary techniques: Searching for similar entities; Nearest neighbor methods; Clustering methods; Distance metrics for calculating similarity
Similarity and Distance
Euclidean distance - common measure of distance between two points
Used for vector comparison, the similarity of one pair of instances to that of another pair
Nearest Neighbor reasoning
A data analysis task used to compare most-similar instances.
important to recognize if category values are mutually exclusive or not
Important issues:
How many neighbor should be used?
Should they have equal weights in the combining function?
Curse of dimensionality: Having too many attributes being factored into the distance calculations can confuse and mislead instance similarities.
One way to mitigate is proper feature selection: the judicious determination of features that should be included in the data mining model
Scaling concerns, $10 of income compared to 10 years of life.
Estimating probability as opposed to choosing the majority provides better results for predictability
Clustering
Centroid: center of a cluster used for defining a cluster
Hierarchical, creates collections of ways to group points not just unique clusters, allows you to see data "landscape"
K means, the mean part is the centroid of a cluster, calculated by averaging all the y values then averageing all the x values for the coordinates. K is the number of clusters one wishes to find in the data.