Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Mining - Coggle Diagram
Data Mining
Clustering
Kmeans
-
-
-
Number of clusters, K, must be specified
-
Details
-
-
‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc
-
-
Complexity is O( n K I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attribute
Evaluating Kmeans
-
Given two clusters, we can choose the one with the smallest error
Given two clusters, we can choose the one with the smallest error
Selecting initial points
PROBLEM : If there are K ‘real’ clusters then the chance of selecting one
centroid from each cluster is small.
SOLUTIONS
-
-
-
Bisecting K-means
-
-
clustering results are refined using the centroids from bisecting k-means as initial centroids for the basic k-means algorithm
Strengths
-
Quite efficient, even though multiple runs are often performed
-
-
-
-
-
-
DBSCAN
-
-
-
-
Strengths
Uses a density based definition of a cluster – relatively resistant to noise and can handle
clusters of arbitrary shapes and sizes
Weakeness
-
Has trouble with high dimensional data, as density (distance) is more difficult to define for
such data
Can be expensive when the computation of nearest neighbors requires computing all
pairwise proximities, as is usually the case for high-dimensional data
-
Clustering
Type of clustering
-
Hierarchical clustering
-
-
Each node (cluster) in the tree is the union of its children (sub-clusters) and the root node is the cluster containing all the objects
-
-
-
-
-
-
-
Reinforcement Learning
Markov Decision
Processes
MDP
-
-
-
-
-
-
Qπ is the value function of the policy for performing an action, starting from a given state.
-
-
-
-
-
-
-
-
-
Association rule mining
Definitions
Support count (s)
-
s({Milk,
Bread,Diaper}) = 2
Support
-
s({Milk, Bread,
Diaper}) = 2/5
-
Association Rule
-
Example:
{Milk, Diaper} ® {Beer}
-
-
Apriori principle
-
-
-
-
If data set is infrequent, super datasets are infrequent
If dataset is frequent , their sub dataset in frquent