Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Ch.6 (similarity and distance (once an object can be represented…
Provost Ch.6
similarity and distance
once an object can be represented as data, we can begin to talk more precisely about the similarity between objects, or alternatively and distance between objects
in order to do so, we need a basic method for measuring similarity or distance
we can compute the overall distance by computing the distances of the individual dimensions - the individual features in our setting.
this called the Euclidean distance between two points, and it's probably the most common geometric distance measure
-
-
-
-
clustering
-
basic idea is that we want to find groups of objects, where the objects within groups are similar, but the objects in different group are not so similar
-
-
similarity
-
data mining procedures often are based on grouping things by similarity or searching for the "right" sort of similarity
example
-
-
-
modern retailers such as amazon and netflix use similarity to provide recommendations of similar products or from similar people
-
geometric interpretation, overfitting, and complexity control
although no explicit boundary is created, there are implicit regions created by instance neighborhoods.
these regions can be calculated by systematically probing points in the instance space, determining each point's classification. and constructing the boundary where classifications change.
boundaries are not lines, nor are they even any recognizable geometric shape; they are erratic and follow frontiers between training instances of different classes
-
probability estimation
to estimate its probability - to assign a score to it, because a score gives more information than just a Yes/No decision
-
Heterogeneous attributes
if attributes are numeric and are directly comparable, the distance calculation is indeed straightforward
when examples contain complex, heterogeneous attributes things become more complicated
fundamental concepts
calculating similarity of objects described by data; using similarity for prediction; Clustering as similarity-based segmentation
exemplary techniques
searching for similar entities; Nearest neighbor methods; Clustering methods; Distance metrics for calculating similarity
-
-