Please enable JavaScript.

Coggle requires JavaScript to display documents.

Dimensionality Reduction & PCA, Statistics, Clustering, Data…

- - - - a. each data vector x-> can be decomposed into its Eigenvector decomposition (eigenvalue decomposition) where the coefficients yaj are given by yaj (projection indices)
      - b. yaj are centered (i. e. mean 0) and pairwise uncorrelated and the eigenvalues lemmai are the variances of the component:
      - c. the matrix C^ can be represented by
- - - - volum ration decreases exponential with increasing epsilon
      - in high-dimensional spaces is almost the complete volume concentrated in a tiny film at the surface
      - Example: high-dimensional normal distribution
        
        according to the densitiy funtion the highest density is around the mean
        
        with increasing dimensionality Po (the area from -standard deviation to mean to standard deviation) decreases
    - - a dense sampling is at best possible for d <= 5 .. 8
      - in high-dimensional spaces, two samples along each dimension are already a lot: 2^d
      - High-dimensional spaces are extremely unsuitable for exploration and for finding regularities
      - (point above) is called the curse of dimensionality
  - - - number of parameters that are used to describe an item in the data set
      - depends upon representation of data
    - - the minimal number of independently variable parameters that determine the structure
      - problem specific
    - - prevalence of a hidden structure manifests often as q << d
      - goal of dimensionality reduction is to reach a new representation with d' dimensions, d' < d, which comes closer to q
  - - - one way to define a dimensionality measure for a set M is to start from the statistic of distances between pairs of data points in a random set of N data points
    - - A different way to define a dimension measure is by counting the minimal number N( ) of pairwise disjunct hypercubes of edge length that are required to obtain a complete covering of M
- - - - centered data (with mean vector (x->) = 0->)
    - - axis that minimizes the squared distance to the data
  - - - was introduced as a curve which fulfills the so called self-consistency condition
      - algorithm
        
        1.
        
        2.
        
        3.
        
        4.
- - - - to check, whether the distributions differ significantly
      - calculation of the empiric cumulative distribution
- - - - many outliers falsify the estimation of meani, variancei
      - remedfy: replace means by more robust
    - - problem
        
        data points could be outliers despite that fact that they are perfectly fine with respect of all features
  - - - Marking
      - Correction
      - Removal of the component
      - Removal of the whole data vector
- - - - Null hypothesis H0: x and y are linearly uncorrelated
  - - - Null hypothesis H0: x and y are uncorrelated
- - - - Initialization: random partitioning of the data set into initial clusters (e. g. heuristically)
      - compute for each possible relabeling of an individual data element xi into a randomly selected cluster Cj
      - if , execute the according relabeling
      - decrease T slightly, e. g. ~ by setting T = a * T with a close to but smaller than 1
      - Termination Condition: after a fixed number of steps (the more the better)
- - - - each data point is exactly assigned to a single prototype (hard clustering)
    - - each data point xi-> is assigned with probability hi,j to the j-th prototype (soft clustering) vector
  - - - [to be continued]
    - - Initialization of prototypes
      - While keeping prototypes fixed, choose assignment hij so that E is minimized (Voronoi cells, winner-takes-it-all rule)
      - While keeping assignments fixed, choose prototypes so that E is minimized
      - If the error decrease | E| < E: stop, else goto 2
  - - - description of the data distribution by a probability density p(x->)
      - instead of disjunct partitioning into 'hard' clusters -> mixture model
    - - each summand describes a soft cluster Cj with centroid müj-> and a spatial extension which is determined by the covariance matrix Summej
      - [to be continued]
    - - what parameters are the most probable ones according to the information given in the data set D?
    - - the trick is the application of Jensen's inequality
      - Initialization: starting values t
      - E-Step: Computation of new membership probablilities
      - M-Step: Computation of new estimates for the cluster centers as well as for the cluster covariances
        
        updated probability mass in cluster k
        
        updated centroid of cluster k
        
        updated covariance of cluster k
      - Loop Condition: if the maximal number of iterations has not yet been reached: goto 2
      - local maximization of L can be achieved by the following heuristic motivated interative method
- - - - [to be continued]
    - - numbers of clusters
      - error
      - distance
    - - SLC is inclined to form 'strings of clusters'
    - - CLC has the tendency to result in compact sphere-shaped clusters
    - - well balanced, between CLC and SLC
    - - each cluster is represented by the mean of its center factors
      - the computation of means requires real-valued variables
      - attention: when merging two clusters, the center-of-mass of the resulting cluster is dominated by the bigger cluster
    - - with each step at pair of clusters Ci, Cj is merged that increases the mean standard deviation of data around the clusters centers the least
      - this favors the formation of spherical clusters
- - - - compare the cluster quality for a given number of clusters c
      - choose the smallest number of clusters with acceptable quality
      - -> this shifts the problem towards the definition of a suitable cluster quality measure
    - - compare the logarithmized intra-cluster variance for c clusters with their mean value for a set of M enforced clusterings, using uniform distributed random data sets into likewise c clusters
    - - approach by Calinsky & Harabasz
      - alternative approach by Hartican (1975)
- - - - instead of a mapping that is true for distances, we aim at best possible maintenence of distance rank order
- - - - Exploratory Data Analysis
      - Confirmative Data Analysis
- - - - more short data point pair distances
      - while having the same global variance as the other plot
    - - method 1: sequential application on the 1D method
      - method 2: direct minimization of an analogue defined structure quality measure for two directions k, l