Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Chapter 6 (Distance (Nearest-Neghbor (Most simmler data is the…
Provost Chapter 6
Distance
The closer the feture vector are the more simmler the object
Can use basic geometry
Repersent over a two demesional space
Nearest-Neghbor
Most simmler data is the nearist neighbot
groups of simmler data are neighborhoods
geometric shapes
with new data use the nearest-neighbor to predict
calculate scores from neighborhoods
Give a probablity estimation
how to chose K
use cross validation
hetorerogenous attributes (non-numeric, different scales)
Manhattan distance
Jaccard distance
cosine distance
text simmlarity
Clustering
Finding natural groupings in data not spicificaly based on a target chracterisitic
application of simmlarity
groups points by simmlarity
Hirerarchal clustering
collection of ways to group the points
see the landsccape of data similarity
merge node together
Centroid
cluster center
start with numer of clusters
experment with K values untill desired result
average of values for each feuture
How to understand
elements can be represented by their names
use additonal metrics such as "best sellers"
setp back and look at the overall problem not just data exploration.
You can use supervesed learing to generre decriptions
used superives learning to find differances in clusters
Fundemental Concepts
Calculating simmlarty of objects
Using Simmlarity for prediction
Clustring as similarity-based segmentation
Exemplary techniques
Seacrching for simmler entities
Nerist Neighbor methods
problems
computational efficancy
Dimmensionality and domain knowlage
intelligbiliy
computational efficancy
Distance metrics for calc simmlarity
Similarity
share common characterstics
Data mining is based on grouping by simmlarity and sorting for the "right" simmlarity
Classification and regression
Types of calsification
Majority voting
weighted voting
Weighted scoring
use regression with distance