Learning for Parallel and Distributed Clustering Algorithms

Platforms for Parallel and Distributed Computing

GPU, MapReduce, Hadoop, SPARK CloudStack, MPI, OpenMP

Parallel & Distributed System Validations

Speedup, Scaleup, Number of clusters v/s number of machines, Runtime, Time comparison for objects, Distance threshold & number of nodes

Apt Distance and Similarity Measures for Parallel and Distributed Clustering

Apt Evaluation Indicators for Parallel & Distributed algorithms

Popular Algorithms Implemented Parallel and Distributed for Large Data

Algorithms Meeting all Criterion for Parallel and Distributed Implementation

K-means, BIRCH, CLARA, CURE, DBSCAN and Wavecluster

Internal

Distance Measures

Standardized Euclidean, Cosine Distance, Pearson Correlation Distance

Similarity Measures

Jaccard Similarity, For data of mixed type

External

Dunn Indicator, Silhouette Coefficient

Rand Indicator, F Indicator, Jaccard Indicator, Confusion Matrix

Characteristics Possessed by Clustering Algorithm for Parallel & Distributed Implementation

Low time-complexity, Highly scalable, support for large & high dimensional data,
suitable to arbitary data set, insensitive to noise & outliers, order insensitive

CURE, STING

Challenges for Parallel and Distributed Implementation

Work-efficient algorithms do not exhibit massive parallel or distributed behavior, Algorithms with ample parallel or distributed behavior but questionable numerical stability and catastrophic load imbalance due to highly non-uniform data distribution