Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch.6: Similarity, Neighbors, and Clusters (Similarity (Uses (Retrieval…
Ch.6: Similarity, Neighbors, and Clusters
Similarity
Uses
Retrieval (server information)
Classification and Regression
Clustering (unsupervised segmentation)
Recommendations (Netflix and Amazon)
Medicine and Law (precedent)
Similarity and Distance
Similarity can be measured by the distance between two objects
Euclidean Distance: distance using pythagorean theorem
Nearest Neighbor Reasoning
Ex: Whiskey Analytics
how to describe scotch as a feature vector
Color, nose, body, palate, finish
For Predictive Modeling
Take similarity measure of a new variable for existing models
Use combining function to predict the new target variable's value
Classification
known target variables (classes) are consulted
Probability Estimation
Assign a score to a new variable
Regression
Predictive mining task based on nearest neighbors
How many neighbors and how much influence?
Nearest Neighbor algorithms are classified as k-NN
k is # of neighbors used
Need to consider how close the neighbors are to the instance
Solved by using weighted voting
Overfitting and Complexity Control
smaller NN classifier has more erratic boundary
k = n then there is not much complexity in the model
k = 1 then the model will be extremely complex
Issues with NN methods
Intelligibility
the intelligibility of an entire model
justification of a specific decision
Dimensionality and Domain Knowledge
scaling of numeric attributes
similarity can be misled by the presence of too many irrelevant attributes
Fix using feature selection
or tuning similarity function manually
Computational efficiency
classification step is expensive
needs to be completed very quickly
Important Technical Details (Similarities and Neighbors)
Heterogeneous Attributes
difficult if there are multiple values for a categorical attribute
Other distance functions
Manhattan Distance
sums the differences along the different dimensions
Jaccard Distance
Treats the two objects as a set of characteristics
Cosine Distance
often used in text documents to measure similarity of the two documents
Combining Functions
Majority Vote Classification
Majority Scoring Function
Similarity Moderated Classification
Similarity Moderated Scoring
Similarity Moderated Regression
Clustering
The idea of finding groupings in the data
Hierarchical Clustering
lowest level is when the points are by themselves making them trivial
Highest level is the most general
Linkage Function applied to the distance between two clusters
Clustering around centroids (NN Revised)
Focus on the center of the cluster
k-means clustering algorithm
The means are the centroids represented by the averages of the values in the cluster
k is simply the number of clusters that one would like to find in the data
Common concern is how to determine a good value for k
Understanding the Results
Need to understand the centroids meaning vs the data values themselves
Solving a business problem vs data exploration
want to perform unsupervised segmentation to find groups that naturally occur
Direct tradeoff of where effort is expensed in the data mining process
for problems where we werent as precise in defining the problem will have to be mad up in the evaluation stage