Please enable JavaScript.

Coggle requires JavaScript to display documents.

Clustering (K-means (Limitaions (Works well for simple clusters that are…

- - - - n_init
        
        just reruns the algorithm with n different initialisations and returns the best output (measured by the within cluster sum of squares)
      - method
        
        By setting the latter to ‘kmeans++’ (the default), the initial centres are smartly selected (i.e. better than random). This has the additional benefit of decreasing runtime (less steps to reach convergence).
  - - - Its average complexity is O(knT), where k,n and T are the number of clusters, samples and iterations, respectively.
- - - - Algorithm
        
        A sequence of clustering is done where the most similar two cluster at each stage are merged into a new cluster. This process is repeated until some stopping condition is met
        
        In sklearn the stopping condition is the number of clusters
        
        You can choose how to determine the most similar clusters by specifying one of several possible linkage criteria
        
        Each point is put to it's own cluster of one item
        
        Works by doing iterative bottom-up approach
      - Linkage criteria
        
        Ward's method
        
        Least increase in total variance (around cluster centroids)
        
        Works well on most datasets and is a usual method of choice
        
        Average linkage
        
        Average distance between clusters
        
        Complete linkage
        
        Max distance between clusters
        
        In some cases if you expect the size of the clusters to be very different, it's worth trying average and comlete linkage criteria as well
      - Hiererchy
        
        it automatically arranges the data into a hierarchy as an effect of the algotithm, reflecting the order and cluster distance at which each data point is assigned to successive cluster
        
        This hierarchy can be useful to visualize using a dendogram, which can be used even with high-dimensional data
        
        Note that you can tell how apart the merged clusters are by the length of each branch in the tree. This property of a dendogram can help us fugure out the right number of clusters.
        
        In general we want clusters that have highly similar items within each clusters, but there are far apart from other clusters. The bigger is Y - branch length, the more was a distance between two clusters merged
        
        Sklearn doesn't provide the ability to plot dendograms, but scipy does. Scipy handles clustering a bit different
        
        Typically making use of thise hierarchy is most useful when the underlying data itself follows some kidn of hierarchical process so the tree is easily interpreted. Ex.: genetic and othe biological data, where the levels represent stages of mutation or evolution
      - comes in somewhat better at O(n^2 log(n)) (though special cases of O(n^2) are available for single and maximum linkage agglomerative clustering)
    - - Starts with the entire dataset comprising one cluster that is iteratively split- one point at a time- until each point forms its own cluster.
      - Divisive clustering is O(2^n)