Chapter 10 - Dimensionality Reduction Using Feature Selection , AN Note:…
Chapter 10 - Dimensionality Reduction Using Feature Selection
In this alternative approach, selecting high-quality, informative features and dropping less useful features, This is called
Three (3) types of feature selection methods
Use trial and error to find the subset of features that produce models with the highest quality predictions.
Select the best feature subset as part or as an extension of a learning algorithm's training process.
Select best features by examining their statistical properties
we discussed in chapter 9, how to reduce dimensionality of our feature matrix, by creating new features with (ideally) similar ability to train quality models but with significantly fewer dimensions.
This is called Feature Extraction
are closely intertwined with specific learning algorithms, they are
difficult to explain
prior to a deeper dive into algorithms themselves. Therefore we will cover
only Filter and Wrapper Feature Selection Methods
, embedded are in the chapters of ML algos.
1. Thresholding Numerical Feature Variance
You have a set of numerical features and want to remove those with low variance (i.e. likely containing little information).
Use: thresholder = feature_selection.VarianceThreshold(threshold=0.5)
features_high_variance = thresholder.fit_transform(features)
Do not scale the features before doing this.
Most basic approaches to feature selection
Motivated by idea that features with low variance are likely less interesting than features with high variance.
VT calculates the variance of each feature, and drops all features whose variance does not meet that threshold.
Equation: (1/n) . Sum(xi - mu)^2 / i=1-->n
(n:number of obs)
Two (2) things to keep in mind
2. The Variance threshold is selected manually, so we have to use our own judgment for good value to select (or use a model selection technique describe in Chapter 12)
1. Variance is not centered, that is, ti is in the squared unit of the feature itself. Therefore, the VT will not work when feature sets contain different units (e.g. one feat in years while a diff feat in dollars)
Do not scale the features before doing this
2. Thresholding Binary Feature Variance
: You have a set of binary categorical features and want to remove those with low variance (i.e. likely containing little information)
Use: thresholder = feature_selection.VarianceThreshold(threshold= (.75 * (1 - .75) )
In Bernoulli random variables, variance is calculated as:
Var(x): p . (1-p)
p: is the proportion of observations of class 1.
Therefore, by setting p we can remove features where the vast majority of observations are one class.
Select a subset of features with a Bernoulli random variable variance above a given threshold
3. Handling Highly Correlated Features
: You have a feature matrix and suspect some features are highly correlated
Use:corr_matrix = dataframe.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),k=1).astype(np.bool))
check how upper works
, drop the columns that has greater than 0.95 correlation.
one problem is highly correlated features, if two features are highly correlated then the iunformation they contain is very similar, and it is likely redundant to include both featyres, the solution to highly correlated features is **SIMPLE: remove one of them from the feature set.
: Use a correlation matrix to check for hgihly correlated features, if highly correlated features exist, consider dropping one of the correlated features.
4. Removing Irrelevant Features For Classification
: You have a categorical target vector and want to remove uninformative features.
Use: chi2_selector = feature_selection.SelectKBest(feat_sec.chi2, k=2)
features_kbest = chi2_selector.fit_transform(features, target)
Use: fvalue_selector = SelectKBest(f_classif, k=2)
features_kbest = fvalue_selector.fit_transform(features, target)
features are categorical
, calculate a
chi-square (X^2) statistic between feature and the target vector
features are quantitative,
ANOVA F-value between each feature and the target vector
Instead of selecting a specific number of features, we can
select a percentile (i.e. select top 75% of features with highest F-Values)
Use: SelectPercentile(f_classif, percentile=75)
features_kbest = fvalue_selector.fit_transform(features,target)
features = features.astype(int)
If we have a
, which is
ANOVA F-Value statistic
with each feature and the target vector.
scores examine if, when we group the numerical feature by the target vector, the
means for each group are significantly different.
If the means are not significantly different, this means this feature does not help predict the target variable, therefor the feature is irrelevant.
independence of two (2) categorical vectors
. That is, the statistic is the difference between the observed number of observations in each class of a categorical feature and what we would expect if that feature was independent (i.e. no relationship) with the target vector.
Equation: chi2 = Sum((Oi - Ei)^2 / Ei) \ i=1-->n
Oi: The number of observations in class i
Ei: The number of observations in class i, we would expect if there is no relationship between the feature and target vector.
A chi-squared statistic is a single number that tells you how much difference exists between your observed counts and the counts you would expect if there were no relationship at all in the population.
By Calculating the chi-squared statistic between a feature and the target vectorm we obtain a measurement of the independence between the two.
if the target is independent of the feature variable, then it is irrelevant for our purposes because it contains no information we can use for classification.
On the other hand if the two features are highly dependent, they likely are very informative for training our model.
Important: chi2 only calculated between two (2) categorical vectors. both features target vector and the features are categorical, if we have a numerical feature, we can use the chi-squared technique by first transforming the quantitative featureinto categorical features.
Finally, to use our chi-squared approachm all values need to be non-negative.
We select the features with the best chi-squared statistics. Use SelectKBest.
5. Recursively Eliminating Features
: You want to automatically select the best features to keep.
Use: feature_selection.RFECV(estimator=model, step1, scoring="neg_mean_squared_error")
To see the number of features we should keep: rfecv.n
To see which of those features we should keep: rfecv.support
To see the rankings of the features: rfecv.ranking
, note: (1) is best.
: Use sklearn's RFECV to conduct recursive feature elimination (RFE) using crossvalidation (CV). That is, repeatedly train a model, each time removing a feature until model performance (e.g. accuracy) becomes worse. The remaining features are the best.
how many features should we keep ? We can (hypothetically) repeat this loop until we only have one feature, a better approach required cross-validation.
in RFE with CV after every iteration, we use cross-validation to evaluate our model. if the model perfomed better after we took one specific feature out, then good we continue the loop, if not we return this feature into the feature set and select those features as the best.
The first time we train the model, we include all the features, then we find the feature with the smallest parameter (notice that this assumes the features are either rescaled or standardized), meaning it is less importnat, and remove the feature from the feature set.
RFE with CV is implemented in sklearn using RFECV()
: regression sets the number or proportion of features to drop during each loop.
: parameter sets the metric of quality we use to evaluate our model during cross-validation.
: of type model we want to train.
There is a YouTube Playlist that has much exclusive list for feature selection.