Please enable JavaScript.
Coggle requires JavaScript to display documents.
Similarity/Neighbors CH 6 (Calculating scores from neighbor (Majority vote…
Similarity/Neighbors CH 6
Similarity and Distance
Once an object can be represented as data, we can begin to talk more precisely about the similarity between objects, or alternatively the distance between objects.
the closer two objects are in the space
defined by the features, the more similar they are
instances near each other are treated similarly
for some purpose
we need a basic method for measuring similarity or distance
we can compute the overall distance by computing the distances of the individual dimensions
This is called the Euclidean distance
useful for comparing
the similarity of one pair of instances to that of another pair.
Nearest Neighbor
we could use this measure to find the companies most similar to our best corporate customers, or the online consumers most similar to our best retail customers
IBM does this to help direct its sales force. Online advertisers do this to target ads. These most-similar instances are called nearest neighbors
Predictive modeling
We also can use the idea of nearest neighbors to do predictive modeling in a different way
given a new example whose target variable we want to predict, we scan through all the training examples and choose several that are the most similar
Then we predict the new example’s target value, based on the nearest neighbors’ (known) target values
Classification
Following the basic procedure introduced above, the nearest neighbors (in this example, three of them) are retrieved and their known target variables (classes) are consulted
Calculating scores from neighbor
Majority vote classification
Majority scoring function
Similarity-moderated classification
Similarity-moderated scoring
Similarity-moderated regression
Technical details
Heterogeneous Attributes
Non numeric is difficult
The general principle at work is that care must be taken that the similarity/distance computation is meaningful for the application
use binary when applicable
Distance Functions
Euclidean distance
Manhattan distance
Jaccard distance
Cosine distance
How many neighbors and how much influence
There is no simple answer to how many neighbors should be used
Odd numbers are convenient for breaking ties for majority vote classification with two-class problems
nearest-neighbor methods often use weighted voting or
similarity moderated voting
Geometric Interpretation, Overfitting, and Complexity Control
in terms of overfitting and its avoidance, the k in a k-NN classifier is a complexity parameter
How to choose k?
we can conduct cross-validation or other nested
holdout testing on the training set, for a variety of different values of k
Then when we have chosen a
value of k, we build a k-NN model from the entire training set
Issues
Intelligibility
the justification of a specific decision
Whether such justifications are adequate depends on the application
intelligibility of an entire model
if model intelligibility and justification are critical, nearest-neighbor methods should be avoided
Dimensionality and domain knowledge
numeric attributes may have vastly different ranges
unless they are scaled appropriately the effect of one attribute with a wide range can swamp the effect of another with a much smaller range
Computational efficiency
computational cost of a nearest neighbor method is borne by the prediction/classification