Please enable JavaScript.
Coggle requires JavaScript to display documents.
Dimensionality Reduction & PCA, Statistics, Clustering, Data…
Dimensionality Reduction & PCA
Principal Component Analysis (
PCA
)
3.4
eigenvalues / eigenvector
in general
a
simple standard method for dimensionality reduction
Principle idea:
determine a low-dimensional linear subspace that contains the largest share of the variance
Approach
: maximize the variance of the data after projection onto a vector v->
Method
: determine the zero crossings of all partial derivatives of F(w->)
Total variance of the data distribution
iterative application to perpendicular subspace
Procedure
Compute the estimated covariance matrix C^ of the data set X
Compute the eigenvalues lemma and eigenvectors u^ of C^
additional remarks
a. each data vector x-> can be decomposed into its Eigenvector decomposition (eigenvalue decomposition) where the coefficients yaj are given by yaj (projection indices)
b. yaj are centered (i. e. mean 0) and pairwise uncorrelated and the eigenvalues lemmai are the variances of the component:
c. the matrix C^ can be represented by
Interpretation
the eigenvector decomposition describes each data point (vector) by a new parameter vector y->a (to be continued)
the y->a are obtained by a linear transformation from the x->a. However, the features are now pairwise uncorrelated. (yet not independent!!!)
the eigenvalues lambdaj equal the variance of the respective component yj
relevance for dimensionality reduction
: using the q largest eigenvalues (i. e. projection on span) then minimizes the mean squared error (MSE) among all linear projections onto a q-dimensional linear subspace (called Karhunen-Loeve expansion)
Dimensionality Reduction - in general
3.
Geometry of highdimensional spaces
3.1
3.1.1 High-dimensional volume
volum ration decreases exponential with increasing epsilon
in high-dimensional spaces is almost the complete volume concentrated in a tiny film at the surface
Example: high-dimensional normal distribution
according to the densitiy funtion the highest density is around the mean
with increasing dimensionality Po (the area from -standard deviation to mean to standard deviation) decreases
3.1.2.
Obstainable sampling density
a dense sampling is at best possible for d <= 5 .. 8
in high-dimensional spaces, two samples along each dimension are already a lot: 2^d
High-dimensional spaces are extremely unsuitable for exploration and for finding regularities
(point above) is called the
curse of dimensionality
Embedding vs. Intrinsic Dimensionality
3.2
q-dimensional manifold in a d-dimensional embedding space
most of the d-dimensional data do not fill a d-dimensional volume at all!
structure is a distorted part of Rq in Rd
discriminate manifolds and subspaces
embedding space dimensionality d
number of parameters that are used to describe an item in the data set
depends upon representation of data
intrinsic data dimensionality q
the minimal number of independently variable parameters that determine the structure
problem specific
summary
prevalence of a hidden structure manifests often as
q << d
goal of dimensionality reduction is to reach a new representation with d' dimensions, d' < d, which comes closer to q
model
(to be continued)
Fractal Dimension 3.3
Correlation Dimension dc
one way to define a dimensionality measure for a set M is to start from the statistic of distances between pairs of data points in a random set of N data points
Hausdorff Dimension dH
A different way to define a dimension measure is by counting the minimal number N( ) of pairwise disjunct hypercubes of edge length that are required to obtain a complete covering of M
dC and dH provide orientation for the choice of parameters in machine learning, e. g. topology of a self-organizing map, etc.
3.6. PCA via
Singular Value Decomposition (
SVD
)
allows an efficient computation of the PCA, so to say: PCA in a single execution step
3.6. Generalization: Principal Axis as Optimization problem
3.6.1 The Mean as Principal Point
3.6.2
Principal Axes
given
centered data (with mean vector (x->) = 0->)
wanted
axis that minimizes the squared distance to the data
Step 1: Minimize
Step 2: Minimize E according to w->
3.6.3.
Principal Curve and Principal Surface
generalization to nonlinear manifolds are considerably more difficult
wanted
: curve f->: R -> Rd, f->(y), which minimizes the Mean Squared Error (MSE)
Problem 1: Parametrization of f->(y)
Problem 2: Limitation of the curve flexibility to avoid trivial interpolations
Problem 3: shape of the optimization landscape: Minimization necessary by all parameters
Principle Curve by Hastie and Stuezle (1989)
was introduced as a curve which fulfills the so called self-consistency condition
algorithm
1.
2.
3.
4.
3.6.4.
Principal Surfaces in general
are the result of a generalization of above principal curve optimization
Computational Procedure
Stochastic approximation
Statistics
2.1 Detection of systematic deviations
empirical means
Basic question
: to evaluate
whether two probability densities
𝑃𝐴,𝑃𝐵
are equal or different
(at hand of limited samples)
variance of a linear combination
standard deviation
student-t-distribution
critical tc
one-sided test
two-sided test
cumulative density function
level of significance a
null hypthesis H0 and alternate hypothesis
Linear Correlation 2.5
Geometric interpretation as cos
(angle between vectors)
Most simple Measure:
Linear Correlation Coefficient
r
given
2.51.
Statistical Interpretation of r
Assumption A1
: let x, y be normal distributed and N large
Assumption A2
: under the (often valid) assumption of a two-dimensional Gaussian distribution
2.3. Tests to discriminate distributions
methods
for discrete distributions ->
X2-test
for continuous distributions ->
Kolmogorov-Smirnov test
there are two variants according to A2: a given distribution (expectation, norm, reference) and a the distribution of another dataset (e. g. data under a different condition)
2.3.1
X2-test
given
Case 1: Comparison of the a dataset with a given distribution
Case 2: Comparison of two datasets
2.3.2
Kolmogorov-Smirnov Test
given
goal
to check, whether the distributions differ significantly
calculation of the empiric cumulative distribution
Alternative Statistics
derive the distribution of D under the assumption that H0 is true
2.4. Detection of Dependencies between variables
Null hypothesis
the variables A und B are independent
Question:
if the null hypothesis can be rejected, how strong is the dependency?
2.4.1
Cramer's V
V = 1 <-- perfect association
V = 0 <-- no association at all
2.4.2
Contingency coefficient
C = 1 will never be reached
Identification of Outliers 2.7
2.7.1.
Outlier Detection
Question:
when do we have to regard a data point as outlier?
z-Test
arbitrary threshold
C
problem
many outliers falsify the estimation of meani, variancei
remedfy: replace means by more robust
More robust Outlier Detection:
arithmetic mean and standard deviation are not very outlier robust ways to identify the center and spread of given data
Roesner-Test;
as long as the z-test delivers at least an outlier
Additional Outlier Complications in multivariate data
problem
data points could be outliers despite that fact that they are perfectly fine with respect of all features
2.7.2.
Outlier Management
Question
: what do we do if we detect an outlier?
Procedure
Marking
Correction
Removal of the component
Removal of the whole data vector
Outlier = Deviations from a model (ideal situation), measurement error
Nonparametric Correlation 2.6
2.6.2.
Spearman's Rank Correlation Coefficient
Statistical Test
Null hypothesis H0: x and y are linearly uncorrelated
Alternative Correlation Measures
2.6.3.
Kendall's griechisch t
(tau)
Statistical Test
Null hypothesis H0: x and y are uncorrelated
concordant
discordant
2.2 Excursion: X2- and Student-t-distribution
X2-distribution
independent normal distributed random variables
Student-t-distribution
with v degrees of freedom
Clustering
Optimization methods for Clustering
general
Idea: derive clustering from an optimization (minimization) of a suitable cost function for clustering into K clusters
Success depends on the definition of a global optimality criterion
The usually applied criteria are derived from the covariance matrix of the data (resp. of the cluster or cluster centers)
Global data covariance matrix
Local covariance matrix (of any cluster i)
Mean covariance matrix of all clusters
Covariance matrix of the cluster centers
Quality criteria (resp. cost functions)
Minimization of trace (
W
)
Minimization of the determinant
det(W)
Maximization of
trace(BW-1)
4.4.1
Simulated Annealing for Clustering
Basic Principle
Initialization
: random partitioning of the data set into initial clusters (e. g. heuristically)
compute for each possible relabeling of an individual data element xi into a randomly selected cluster Cj
if , execute the according relabeling
decrease T slightly, e. g. ~ by setting T = a * T with a close to but smaller than 1
Termination Condition
: after a fixed number of steps (the more the better)
Basic Idea
4.2
Distance Functions
starting point: distance matrix
requirements for a distance measure
symmetry
positive definite
triangle inequality
4.2.1
Distance measures for data points
Distance measure for real valued data vectors are
Euclidean Distance
Pearson- or X2-distance
Mahalanobis-Distance
'
City Block'-Distance
Supremum Distance
Minkowski-Distance
4.2.2
Distance measures between clusters
minimal distance
maximal distance
average distance
centroid distance
4.5
Prototype-based Clustering
Goal: Partitioning by assignment of N data points to K < N typical prototypes (representing the clusters)
two assignments
Deterministic Assignment
each data point is exactly assigned to a single prototype (hard clustering)
Probabilistic Assignment
each data point xi-> is assigned with probability hi,j to the j-th prototype (soft clustering) vector
4.5.1
Hard Clustering by Vector Quanization
idea:
each data point xi-> is assigned to one to K prototypes represented by {wj->}j=1..K by using a label (xi->) {1,...,K}
goal:
minimization of the average quantization errors E by varying the wj->
Suitable Error function
[to be continued]
the following algorithm allows optimization of both assignment variables and prototype locations
Initialization of prototypes
While keeping prototypes fixed, choose assignment hij so that E is minimized (Voronoi cells, winner-takes-it-all rule)
While keeping assignments fixed, choose prototypes so that E is minimized
If the error decrease | E| < E: stop, else goto 2
4.5.2
Soft clustering with Mixture Models
goal:
implement the possibility to represent uncertainty in the assignment of data to clusters (uncertainty will be expressed in form of continuous assignment probabilities)
approach
description of the data distribution by a probability density p(x->)
instead of disjunct partitioning into 'hard' clusters -> mixture model
interpretation
each summand describes a soft cluster Cj with centroid müj-> and a spatial extension which is determined by the covariance matrix Summej
[to be continued]
question
what parameters are the most probable ones according to the information given in the data set D?
EM algorithm (Expectation-Maximization algorithm)
the trick is the application of Jensen's inequality
Initialization
: starting values t
E-Step
: Computation of new membership probablilities
M-Step
: Computation of new estimates for the cluster centers as well as for the cluster covariances
updated probability mass in cluster k
updated centroid of cluster k
updated covariance of cluster k
Loop Condition: if the maximal number of iterations has not yet been reached: goto 2
local maximization of L can be achieved by the following heuristic motivated interative method
4.3 Hierarchic Clustering
4.3.1
Agglomerative Clustering
procedure
[to be continued]
termination conditions
numbers of clusters
error
distance
1. Single Linkage Clustering (
SLC
)
SLC is inclined to form 'strings of clusters'
2. Complete Linkage Clustering (
CLC)
CLC has the tendency to result in compact sphere-shaped clusters
3.
Average linkage clustering
well balanced, between CLC and SLC
Centroid Linkage Clustering
each cluster is represented by the mean of its center factors
the computation of means requires real-valued variables
attention: when merging two clusters, the center-of-mass of the resulting cluster is dominated by the bigger cluster
Ward's Linkage Clustering
with each step at pair of clusters Ci, Cj is merged that increases the mean standard deviation of data around the clusters centers the least
this favors the formation of spherical clusters
leads to a treelike division of clusters. There are two opposite procedures
divisive clustering
: start with a single cluster that contains all data, stepwise division into smaller clusters (
top-down
)
agglomerative clustering:
initialize all data points as 1-element clusters, stepwise merging of clusters (bottom-up)
4.3.2
Recursive Distance Measures
single linkage
complete linkage
centroid linkage clustering
average linkage clustering
Ward's linkage Clustering
4.6.
Evaluation of Clustering results
questions
how many clusters exist?
how unique is the found clustering?
4.6.1
The number of clusters
basic idea
compare the cluster quality for a given number of clusters c
choose the smallest number of clusters with acceptable quality
-> this shifts the problem towards the definition of a suitable cluster quality measure
Gap-Statistic
compare the logarithmized intra-cluster variance for c clusters with their mean value for a set of M enforced clusterings, using uniform distributed random data sets into likewise c clusters
further, more simple propositions
approach by Calinsky & Harabasz
alternative approach by Hartican (1975)
4.6.2.
Uniqueness of Clustering results
basic idea
: compare results of repeated clustering runs under slight variations
measure for comparison for two clusterings C1 and C2 (i. e. complete partitionings of data into clusters)
Rand-Index
(to be continued)
Clustering Methods
4.1
Relevance for Data Mining
Clusters are the simplest form a structure
prevalence of clusters suggests relatedness / affinity and presence of correlations
multidimensional clusters can be interpreted as a simple form of rule
cluster centers offer an economic description of the data (data reduction)
Data visualization
5.2
Elementary Visualization techniques
glyphs
visualization of several data dimensions with help of geometric attributes or color
histograms
are nothing but special glyphs, with a standard usual mapping
usually: bin lengths = frequency of occurrance of points within the discretization bin
alternative utilization: representing a single data point feature k value xk as the kth bin
Chernoff-Faces
mapping of Data features to Object properties
historically introduced with faces as parameterized objects that have dedicated properties
parallel coordinates
representation of a data point as line train between parallel arranged coordinate axes
a data set becomes a bunch of lines (this helps to quickly recognize common trails, corresponding to clusters)
color of lines can be used to mark certain classes, so this can be used for clusering
color coding
maximal 3 dimensions can be mapped on color axes
according to the color space dimensions can be R(ed), G(reen), B(lau), H(ue) S(aturation) I(ntensity) etc.
can be combined with other methods, e. g. scatter plots or glyphs
box plots
summary of probability distributions via structured glyphs
often presented: minimum, 10%-quanitle, o-interval, median, mean, 90%-quantile, maximum, depiction style may vary, i. e. plots require a legend such as
dimensional stacking
definition of a dimension order (slow to fast) (e. g. x1, x2, x3, x4)
discretize in nk intervals along axes k, k = 1,...,d-2
graphical representation is recursively defined
brushing
geometrically defined data sets are highlighted
5.3
Multidimensional Scaling
goal
construction of a low-dimensional 'projection' RD -> Rd which is trying to maintain distances as good as possible ('abstandstreu')
to maximally preserve distances
given
given data points
D = dimension of the data space
wanted low-dimensional image
d = dimension of the projection space
5.3.1
Sammon-Mapping
results from the minimization of a cost function E for distance distortions
distance in the high-dimensional space
distance in the low-dimensional space
approach for a cost function E
5.3.2
Sammon-Mapping with a target table of coordinates
the target points are here simply provided via a look-up table
cost function [to be continued]
5.3.2 Variant:
Non-metric multidimensional Scaling
idea
instead of a mapping that is true for distances, we aim at best possible maintenence of distance rank order
advantage: higher robustness w.r.t. the choice of the distance metrics D
5.1
Motivation
goal
Transformation of high-dimensional data spaces (which are difficult to depict) into our familiar perceptual spaces
visual: visualization
auditory: sonification
tactile: haptualization/ tactile interfaces
reasons/ motivation
a comprehensive automatization of structure discovery is unrealistic in the forseeable future
pattern recognition capabilities of the human eye/ visual system (ditto: ear/auditory systm) are so far not reached by any technical system
Data Mining can roughly by divided into the steps
Exploratory Data Analysis
Confirmative Data Analysis
before we can confirm a structure we first need to know what we are looking for
Visualization (sonification etc.) supports the hypothesis generation within (1.)
the visual system is not only a tool, but also a model: it somehow 'compresses' or condenses subsymbolic light patterns into robustly emerging objects that are recognized and remembered and allows their characterization in terms of material, position, grouping etc.
5.4 Further Projection Methods:
Projection Pursuit
5.4.1
Projection Pursuit
a simple way to find a low-dimensional projection is PCA
a key disadvantage is that PCA is only variance based: no other structural property is respected
Generalization of the principle
'to maximize a criterion for well-structuredness'
yields other (maybe) more meaningful projections
central element is the measure for structuring ('Strukturbewertungsmaß')
many structures manifest in the occurrence of local clustering
the seemingly 'more structured' plot is characterized by
more short data point pair distances
while having the same global variance as the other plot
a method that is only sensitive to variance can't assess any difference between these plots
generalization to 2D-projections
method 1: sequential application on the 1D method
method 2: direct minimization of an analogue defined structure quality measure for two directions k, l