Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Chapter - Clustering (Edit distance (Great for typos, street…
Provost Chapter - Clustering
To group similar items/people into clusters
Via distance
Euclidian distance
Draw a right triangle, find the longest side via Pythag. thrm
By assigning a distance between points based on relevant points:
This allows us to find items that are alike to the original pt (based on approximate value)
IS the most common, but there are about a dozen others used in mining
Neighbors
Has a set amount of NN (Nearby neighbors)
This is determined by the data scientist (odd # for tie, etc)
Shorthand = k
This sets the complexity - parameter
Has its shortcomings when it comes to unique cases/fields
House loans - denial based off of being similar to other families
EX: Used by Netflix and Amazon
Has no strict "learning" about any one customer/base
Edit distance
Used for text
Shows how alike two strings are
Great for typos, street addresses
Has regex-like correction options
Also used for when order of strings is important
Biology
Neighbor formulas
Clustering
Dendogram
Order these groups like a jawbreaker
The layers are set inside one another, deeper = narrower classification
An advantage of hierarchical clustering is that it allows the data analyst to see the landscape
Can form around nodes
Using Supervised Learning to Generate Cluster Descriptions
Assign names to each possible classifier
Have 0 or 1, binary checks
Sorts in this tree form
EX: iPhone word articles
Analysts typically wind up spending less time at the in the business understanding phase
... And more time in the evaluation stage, and in cycle defining