Please enable JavaScript.
Coggle requires JavaScript to display documents.
Module 5 - Clustering pt.1 [Unsupervised Learning], Difference between…
Module 5 - Clustering pt.1
[Unsupervised Learning]
Measurement of clustering
To evaluate the similarity of clustering, we use similarity metrics
The similarity measure is the measure of how much alike two data objects are
Similarity Metrics
Euclidean distance
: the length of a line segment between the two points
Correlation coefficient
: summarizes the relationship between two variable
Cluster Analysis
Definition
:Grouping a set of data objects into clusters
Application :
As
stand‐alone tool
to get insight into data distribution // descriptive analysis
As a
preprocessing step for other algorithms
The main idea of cluster analysis is that it would arrange all the data points by forming clusters
Example Use Cases:
It is widely used in image processing, data analysis, and pattern recognition.
It helps marketers to find the distinct groups in their customer base and they can characterize their customer groups by using purchasing patterns.
It can be used in the field of biology, by deriving animal and plant taxonomies, identifying genes with the same capabilities.
It also helps in information discovery by classifying documents on the web.
Clustering Problem
Problem:
how to define a mapping/characteristic to further group the data into new category / cluster
Good Clustering Characteristic:
High
Intra-Class
similarity
High
Inter-Class
similarity
Quality of Clustering:
Result
: result depends on both the similarity measure used by the method and its implementation.
Method
: measured by its ability to discover some or all of the hidden patterns
Similarity
:
Definition
: “The quality or state of being similar; likeness; resemblance; as, a similarity of features”.
Difference between Classification and Clustering
Does not use training dataset
Uses a Training dataset
There are no labels
There are labels for training data
Classification is the problem of identifying to which of a set of categories
Partitioning a set of data info a set of classes that called "cluster"
Uses statistical concepts (eg: Euclidean distance, Correlation coefficient) ; dataset is split into subsets with similar features.
Uses Algorithm to categories the new data according to the observation of the training set
The aim is to find which class a new object belongs to from the predefined classes
The aim is to group a set of objects in order to find whether there is any relationship between them
Example of algorithm in Classification:
Decision Tree
Naive Bayes
Example algorithm :
K-means
K-medoid
Expectation maximization
Difference between Classification and Clustering
Clustering
Classification