Please enable JavaScript.
Coggle requires JavaScript to display documents.
8 Cluster Ananlysis: Basic Concepts and Algorithms (8.2 K-means (8.2.1…
8 Cluster Ananlysis:
Basic Concepts and Algorithms
8.1 Concepts
Cluster analysis groups data objects based only on information found in the data that describes the objects and their relationships.
8.1.2 Different Types of Clusterings
Hierarchical (nested)
Clusters can have subclusters, tree;
Partitional (unnested)
Each object only belongs to one cluster;
Exclusive
Each object only belongs to one cluster;
Overlapping
A object can simultaneously belong to more than one clusters;
Fuzzy
Every object belongs to every cluster with a membership weight (0~1);
Complete
Clustering all the objects;
Partial
Clustering part of the objects;
8.1.3 Different Types of Clusters
Well-Separated
Prototype-Based (center-based)
Graph-Based (contiguity-based)
Density-Based
Shared-Property (conceptual clusters)
8.2 K-means
8.2.1 The Basic K-Means Algorithm
-select K initial centroids;
-Repeat:
assigning all points to this K clusters;
update centroids;
until centroids do not change.
Assigning Points to the Closest Centroid
Euclidean ( \(L_2\)) distance;
Cosine similarity for documents;
Manhattan (\(L_1\)) for Euclidean data
Jaccard measure for documents.
Centroids and Objective Functions
Data in Euclidean Space
goal is minimise
sum of the squared error (SSE), scatter
\( SSE = \displaystyle \sum_{i=1}^K \sum_{x \in C_i} dist(c_i,x)^2 \)
Document Data
Proximity function: cosine;
Objective function: maximise **sum of cosine similarity, cohesion:
\( Total Cohesion = \displaystyle \sum_{i=1}^K \sum_{x \ in C_i} cosine(x,c_i) \)
Choosing Initial Centroids
The choice of initial centroids affect the final clustering result.
random
repeated random
sample points and hierarchical clustering them to get centroids;
bisecting K-means
postprocessing
8.2.2 Other Issues
Handling empty clusters: to choose a replacement centroid
Outliers: preprocess and postprocess
Reducing the SSE with postprocessing
by increasing the number of clusters:
split a cluster
introduce a new cluster centroid
decreasing clusters while trying to minimize the increase in total SSE
disperse a cluster
merge two clusters
Updating centroids incrementally
8.2.3 Bisecting K-means
an extension of basic K-means;
successively split total data set until get K clusters;
split criteria: may based on SSE or/and size;
often
refine the result clusters
with basic K-means, using the result centroids as the initial centroids of basic K-means.
8.2.4 K-means and Different Types of Clusters
K-means may fail in some natural clusters;
it is more suited for globular clusters or well separated;
improved by increase the number of clusters;
8.2.5 Strengths and Weaknesses
Strengths
: Simple, efficient, bisecting is less susceptible to initialization problems;
Weaknesses
:
not suited for non-globular clusters;
not suited for clusters with different sizes and densities;
susceptible to outliers;
restricted to data with notion of a center (centroid);
8.3 Agglomerative Hierarchical Clustering
Agglomerative (Dendrogram, Nested cluster diagram)
Divisive
8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm
merge two most proximate clusters until one cluster
Defining Proximity between Clusters
single link (MIN)
complete link (MAX)
group average
Ward's method
centroid methods
Time complexity: \( O(m^2 \log m ) \)
Space complexity: \( O(m^2) \)
8.3.3 The Lance-Williams Formula for Cluster Proximity
General expression of clusters's proximity
\( p(R,Q) = \alpha_A p(A,Q) + \alpha_B p(B,Q) \\ + \beta p(A,B) + \gamma |p(A,Q) - p(B,Q)| \)
coefficients
8.3.4 Key Issues in Hierarchical Clustering
Lack of global objective function;
Ability to handle different cluster sizes: weighted/unweighted;
Merging decisions are final;
8.4 DBSCAN
(Density-based spatial clustering of applications with noise)
8.4.1 Center-Based Approach
radius \( E_{ps} \)
threshold
MinPts
Core poins, Border points, Nose poins
8.4.2The DBSCAN Algorithm
selection parameters: the sharp change in k-dist graph
8.4.3 Strengths and weaknesses
Strengths
:
resistant to noise;
arbitrary shapes and sizes;
Weaknesses
: has trouble in
widely varying desityies;
high-dimension data
computationally expensive
8.5 Cluster Evaluation
Unsupervised (internal indices)
cohesion (compactness, tightness)
separation (isolation)
Supervised (external indices)
Relative
8.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation
\( \text{overall validity} = \displaystyle \sum_{i=1}^K w_i validity(C_i) \)
Graph-Based
\( cohension(C_i) = \displaystyle \sum_{x \in C_i \\ y \in C_i} proximity(x,y) \)
\( separation(C_i, C_j) = \displaystyle \sum_{X \in C_i \\ y \in C_j} proximity(x,y) \)
Table 8.6 Table of graph-based cluster evaluation measures
Prototype-Based
\( cohesion(C_i) = \displaystyle \sum_{x \in C_i} proximity(x,c_i) \)
\( separation(C_i, C_j) = proximity(c_i, c_j) \)
\( separation(C_i) = proximity(c_i,c) \)
The Silhouette Coefficient
(-1,1), closer to 1 is better
\( \text{total silhouette coefficient} = mean( s_i) \)
\( s_i = (b_i - a_i) / max(a_i, b_i) \)
\( \text{where}: \)
\(s_i : \text{the silhouette coefficient of } i^{th} \text{ object}; \)
\( a_i: \text{the average distance to all other objects in its cluster;} \)
\( b_i: \text{the minimum distance to any other cluster's object;} \)
8.5.3 Unsupervised Cluster Evaluation Using the Proximity matrix
Measuring cluster validity via
Correlation
Judging a clustering visually by its similarity matrix
8.5.4 Unsupervised Evaluation of Hierarchical Clstering
cophenetic distance
CoPhenetic Correlation Coefficient
8.5.5 Determining the Correct Number of Clusters
to find the knee in SSE and the a peak in silhouette coefficient;
8.5.7 Supervised Measures of Cluster Validity
Classification-oriented
: entropy, purity, precision, recall, F-measure;
similarity-oriented