Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Chapter 6: Clustering and Beyond (Hierarchical Clustering…
Provost Chapter 6: Clustering and Beyond
Clustering Intro
Basic Definition: Finding natural groupings in the data
Expanded
Find Groups of objects (consumers, businesses, whiskeys, etc.), where the object within the groups are similar, but the objects in the different groups are not so similar
Hierarchical Clustering
Groups the points by their similarity
Dendrogram
Figure that shows explicitly
the hierarchy of the clusters
y axis
represents the distance
between the clusters
Advantages
Allows the data analyst to see the groupings before deciding on the number of clusters to extract
Linkage function
For hierarchical clustering, we need a distance function between clusters, considering individual instances to be the smallest clusters
Nearest Neighbors Revisited: Clustering Around Centroids
Basic Understanding
Method for focusing on the clusters themselves, represent each cluster by its “cluster center,” or
centroid
Geometric center of a group of instances
k-means
“means” are the centroids, represented by the arithmetic means (averages) of the values along each dimension for the instances in the cluster
k
in k-means is the number of clusters that you'd like to find in the data.
Run many times, starting with different random centroids each
time
How to determine a
good value for k
2 Options
Experiment with increasing values of k and graph various metrics (sometimes obliquely called indices) of the quality of the resulting clusterings
Experiment with different k values and see
which ones generate good results
Understanding the Results of Clustering
Whiskey Example
In this case, the names of the data points are meaningful
in and of themselves, and convey meaning to an expert in the field
Using Supervised Learning to Generate Cluster Descriptions
Basic idea
Use the cluster assignments to label examples. Each example will be given a label of the cluster it belongs to, and these can be treated as class labels
Create differential descriptions
Two ways to set up classification task
k clusters
k learning class
Stepping Back: Solving a Business Problem Versus Data Exploration