Chapter 9 - Dimensionality Reduction Using Feature Extraction - Coggle…
Chapter 9 - Dimensionality Reduction Using Feature Extraction
If we take an image of 256x256 pixels, we have 196,608 features, each pixel can have one of 256 value, we end up with 256^196608 different configurations our observations can take.
This means that we would not be able to collect data so that we cover even a small fraction of this configuration hence our learning algorithm do not have enough data to operate correctly.
The goal of feature extraction for dimensionality reduction
is to reduce the number of features with only a small loss in our data's ability to generate high-quality predictions.
One downside of the feature extraction techniques we discuss is that the new features we generate will not be interpretable by humans. will appear as random number for the human eye.
If we wanted to maintain our ability to interpret our models dimensionality reduction through feature selection is a better option.
1. Reducing Features Using Principal Components
: Given a set of features, you want to reduce the number of features while retaining the variance in the data.
AN Note: Linear Dimensionality reduction technique.
PCA projects observations onto the (hopefully fewer) principal components of the feature matrix that retain the most variance.
For a mathematical description of how PCA works, check external resources, AN chose the below:
PCA is for of unsupervised technique, only takes feature matrix, no target vector.
PCA is a popular linear dimensionality reduction technique
If we wanted to reduce our features, one strategy would be to project all observations in our 2D space onto the 1D principal component, we would lose the information captured in the second principal component. but in some situations that would be an acceptable trade-off. This is PCA.
Example: if we have 2 features x1 and x2 and they spread out like a cigar with a lot of length and very little height, we can say that the variance of the "length" is significantly greater than the "height".
Instead of length and height we refer to the "directions" with the most variance as the 1st principal component and the "direction" with the second-most variance as the 2nd principal component and so on.
Transforms the values of each principal component so that htey have zero mean and unit variance.
Which implements a stochastic algorithm to find the first principal components in often significantly less time.
: has two (2) operations
If argument > 1 n_components will return that many features.
Fortunately for us, if between 0 and 1, PCA returns the minimum amount of features that retain that much variance. (i.e. 99%)
AN: See below the interpretation of the PCA output:
The output of the solution in the book shows that PCA let us reduce our dimensionality by 10 features while still retaining 99% of the information (variance) in the feature matrix.
Use: decomposition.PCA(n_components=0.99, whitten=True)
Creating a PCA that will retain 99% of variance.
features_pca = pca.fit_trans(features_std)
2. Reducing Features When Data Is Linearly Inseparable
: You suspect you have linearly inseparable data and want to reduce the dimensions.
More on the kernel trick
Kernel options: rbf, poly, sigmoid, linear, maybe more options.
We want a transformation that would both reduce the dimensions and also make the data linearly separable. Kernel PCA can do both.
Kernels allow us to project the linearly inseparable data into a higher dimension where it is linearly separable, this is called the kernel trick.
In the example in the book, if we use linear PCA the 1st principal components the 2 classes will be overlapping.
Standard PCA uses linear projection to reduce the features.
If our data is not linearly separable (we can separate classes using a curved decision boundary only),we use KernelPCA with a non-linear kernel.
If the data is linearly separable (i.e. you can draw a stright line or hyperplane between different classes) then PCA works well.
One (1) downside of kernel PCA is that there are a number of parameters we need to specify, we can not add 99% as n_components to return a set of features that have 99% feature variance like we have seen with the Standard PCA.
Each kernel comes with their own hyperparameters, example: radial basis function require
How do we know which values to use? Through trial and error, train our machine learning model multiple times, each time with a different kernel or different value of the parameter. Once we find a combination of values that produces the highest qualityy predicted values, we are done. we learn this in depth in chapter 12.
Use: kpca = decomposition.KernelPCA(kernel="rbf", gamma=15,n_components=1)
features_kpca = kpca.fit_transform(features)
3. Reducing Features by Maximizing Class Separability
: You want to reduce the features to be used by a classifier.
Try Linear discriminant analysis (LDA) to project the features onto component axes that maximize the separation of classes
we can use lda.explained_variance_ratio__ to view the amount of variance explained by each component.
Use: lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=1)
features_lda = lda.fit(features,target).transform(features)
n_components: how many parameters to keep.
PCA we were only inretested in the component axes that maximize the variance in the data, while LDA we have the additional goal of maximizing the difference between classes.
In our example we have the x axis to be the axes of maximum discrimination between classes, we can project our feature space on it, hence we reduce the our dimensionality by one.
Similar to PCA it projects our feature space onto a lower-dimensional space. The difference is in the next point.
we can take advantage of the fact that .explained_variance_ratio
tell us the variance explained by each outputted feature and is a sorted array. For example: lda.explained_variance_ratio
, returns: array([0.99147248])
LDA is a classification that is also a popular technique for dimensionality reduction.
we can run
to return the ratio of variance explained by every component feature, then calculate how many components are required to get above some threshold of variance
(often 0.95 or 0.99)
check code on the book to see how
4. Reducing Features Using Matrix Factorization
: You have a feature matrix of nonnegative values and want to reduce the dimensionality.
d x _n feature matrix
(i.e., d features, n observations).
d x r
r x m
By adjusting the value of
we can set the amount of dimensionality reduction desired.
Given the desired number of returned features,r.
NMF factorizes our feature matrix such that: V ~~ WH
NMF is an unsupervised technique for linear dimensionality reduction
that factorizes (i.e. breaks up into multiple matrices whose product approximates the original matrix) the feature matrix into matrices representing the latent relationship between observations and their features.
NMF does not provide us with the explained variance of the outputtted features unlike PCA.
Use: decomposition.NMF(n_components=10, randstate=1)
Use non-negative matrix factorization (NMF) to reduce the dimensionality of the feature matrix.
5. Reducing Features on Sparse Data
: You have a sparse feature matrix and want to reduce the dimensionality.
TSVD is similar to PCA and in fact, PCA actually often uses non-truncated SVD in one of ites steps.
As with linear discriminant analysis, we have to specify the number of features (components) we want outputted.
One issue with TSVD is that because it uses a random number generator, the signs of the output can flip between fittings, a work around is to fit
once (1) per preprocessing pipeline, then use
In regular SVD, given d features SVD will create factor matrices that are d x d,
TSBD will return factors that are n x n, where n is previously specified by a parameter.
The practical advantage of TSVD is that unlike PCA, it works on sparse feature matrices.
What is the optimum number of components? include m_components as a hyperparameter to ptimize during model selection (i.e. choose hte value for n_components that produces the best trained model.
Use: tsvd_var_ratios = tsvd.explained_variance_ratio__
tsvd = TruncatedSVD(n_components=features_sparse.shape -1 )
Alternatively, because TSVD provides us with the ratio of the original feature matrix's variance explained by each component
. for example in our solution the first three outputted components explain approximately 30% of the original data's variance. (Check the automation of this process on page 167.
Use: tsvd = decomposition.TruncatedSVD(n_components=10)
features = scaler.fit_tranform(digits.data)
features_sparse = sparse.csr_matrix(features)
features_sparse_tsvd = tsvd.fit(features_sparse).transform(features_sparse)