Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Chapter 6: Up to Clustering (Nearest Neighbor (Whiskey Example…
Provost Chapter 6: Up to Clustering
Similarity And Distance
Euclidean distance
Purpose: Compute
the overall distance
Common
geometric distance
measure
Not limited to two dimensions
Nearest Neighbor
Whiskey Example
Create numeric features that will summarize
the information in the tasting notes
Whiskey description Example
Color: gold
• Nose: fresh and sea
• Body: firm, medium, and light
• Palate: sweet, fruity, and clean
• Finish: full
Predictive Modeling
Choose target variable we want to predict, scan through all the training examples and choose several that are the most similar to the variable
Classification
Probability Classification
Compute the probability estimates, using nearest neighbors
Regression
How Many Neighbors and How Much Influence?
Nearest neighbor algorithms
k
-NN
If we increase k to the maximum possible (so that k = n) the entire dataset would be used for every prediction
this would predict the majority class in the entire
dataset
Odd Numbers Break ties
Weighted voting
or
similarity moderated voting
Contribution is scaled by its
similarity
Weighted scoring
Geometric Interpretation, Overfitting, and Complexity Control
Boundaries
Implicit regions created by instance neighborhoods
Form boundaries in instance space tailored to the specific data used for training
Overfitting/Avoidance
k = n, we do not allow much complexity
K = 1, we will get an extremely complex model
Issues with Nearest-Neighbor Methods
Intelligibility
Two aspects
The justification of a specific
Intelligibility of an entire model
Dimensionality and domain knowledge
Scaling Problems
Numeric attributes may have vastly different ranges
Unless ranges are scaled appropriately the effect of one attribute with a
wide range
can swamp the effect of another with a much
smaller range
Fix Using Feature Selection
Computational efficiency
Cost
of a nearest neighbor method is from the prediction/classification step
Some Important Technical Details Relating to Similarities and Neighbors
Heterogeneous Attributes
Other Distance functions
Euclidean distance (L2 norm)
Manhattan distance (L1 norm)
Jaccard distance
Cosine distance
ignore the magnitude of the vectors
Calculating Scores from Neighbors
Majority Vote Classification
c(x) = arg max c∈classes score(c, neighbors k (x))
Majority scoring function
score(c, N ) = ∑y[N class(y) = c]