Learning for Parallel and Distributed Clustering Algorithms
Platforms for Parallel and Distributed Computing
GPU, MapReduce, Hadoop, SPARK CloudStack, MPI, OpenMP
Parallel & Distributed System Validations
Speedup, Scaleup, Number of clusters v/s number of machines, Runtime, Time comparison for objects, Distance threshold & number of nodes
Apt Distance and Similarity Measures for Parallel and Distributed Clustering
Apt Evaluation Indicators for Parallel & Distributed algorithms
Popular Algorithms Implemented Parallel and Distributed for Large Data
Algorithms Meeting all Criterion for Parallel and Distributed Implementation
K-means, BIRCH, CLARA, CURE, DBSCAN and Wavecluster
Internal
Distance Measures
Standardized Euclidean, Cosine Distance, Pearson Correlation Distance
Similarity Measures
Jaccard Similarity, For data of mixed type
External
Dunn Indicator, Silhouette Coefficient
Rand Indicator, F Indicator, Jaccard Indicator, Confusion Matrix
Characteristics Possessed by Clustering Algorithm for Parallel & Distributed Implementation
Low time-complexity, Highly scalable, support for large & high dimensional data,
suitable to arbitary data set, insensitive to noise & outliers, order insensitive
CURE, STING
Challenges for Parallel and Distributed Implementation
Work-efficient algorithms do not exhibit massive parallel or distributed behavior, Algorithms with ample parallel or distributed behavior but questionable numerical stability and catastrophic load imbalance due to highly non-uniform data distribution