Chapter 4 Building Training Sets + Preprocessing
Missing Values Problem
L1 and L2 regularization
Estimators-classifiers with an API similar to transformer class
Feature Scaling
Handling Categorical Data
scikit-learn Transformer classes, used for data transformation
Important for scale dependant algorithms, like GD.
Normalization - scale 0-1
2 Types: Nominal (category)
Ordinal (number)
kill missing values with pandas.df
Imputer, a transformer
key methods
have predict method
have transform method
Generalization Error Problem
L2 Regularization introduces a penalty for large individual weights
Topics
Remove/impute missing values
Get categorical data into shape via one hot
Select relevant features
.isnull().sum() counts missing values per column
dropna(axis=0) would kill rows with missing data. dropna(axis=1) kill columns. dropna(how='all') kills rows where all are na. (thresh=x) drop rows with fewer than x real values
dropna(subset=['Name']) drop rows with missing data in 'Name' feature
Mean Imputation, average the feature column and replace the missing value
sk.preprocessing.Imputer; imr=Imputer(missing_values='NaN',strategy='mean',axis=0); imr.fit(df.values)
fit- learn parameters from training data
transform; any array that is transformed needs to have n features=data array for fitting
Use one hot encoding for each category but one, to avoid problems with non-invertible matrices (dependence)
Eg, size of clothing
create size mapping from size to numebr
sklearn.preprocessing.LabelEncoder; cle=LabelEncoder(); y=cle.fit_transform(df['label'].values)
LabelEncoder class in sklearn.preprocessing for encoding labels
.preprocessing.OneHotEncoder(categorical_features=[column_num])
Get_dummies in pandas creates onehots, use drop_first=True to kill multicollinearity
Standardization - divide by std dev
.preprocessing.MinMaxScaler
preprocessing.StandardScaler
Collect more data
Introduce a complexity penalty
Choose a simpler model with fewer params
Reduce data dimensinoality
L1 Regularization varies from L2 by replacing the square of the weights with their absolute values
L1 Reg yields sparse feature vectors, with most weights zero. Useful when many features are irrelevant.
.learnear_model.LogisticRegression(penalty='l1')
Dimensionality Reduction
Feature Selection-select a subset of features
Feature Extraction-derive information to create a new feature subspace
Sequential Feature Selection Algorithms are Greedy, reduce d-dimensional space by selecting most relevant features
Sequential Backward Selection (SBS) minimizes dimensionality of feature subspace with minimum decay
Steps: Initialize, choose number of features for feature space
- Determine x- to maximize J(X-k)
- Remove x-
Terminate if k is small enough
Assess feature importance with Random Forests
Measure feature importance from averaged impurity decrease computed from all decision trees
access by feature_importances attribute after fitting RFC.
❗ note that if 2 features are correlated, the more relevant one will be ranked highly, while the other will be undervalued
Ch5 Compressing Data via Dimensionality Redux
Unsupervised dimensionality reduction via principal component analysis
Main steps to PCA
Extracting PC step by step
Total and explained Variance
Feature Transformation
PCA in scikit
Supervised Data Compression via linear Discriminant analysis
PCA vs LDA
Inner workings of LDA: steps
Compute Scatter Matrices
Select Linear Discriminants for new feature subspace
Projecting samples onto new feature space
LDA via scikit
Using Kernel principal component analysis for nonlinear mappings
Kernel functions and kernel trick
Implementing KPCA analysis in python
Projecting new datapoints
KPCA in scikit learn
Summarize Data by transforming to lower dimensionality
Apps include exploratory data analysis and de-noising of signals
Identify patterns in data based on correlation between features, by finding directions of max variance in high dim data, project to new axes
Highly sensitive to Data Scaling
Steps
- Standardize d-dimensional dataset
- Construct covariance matrix
- Decompose covariance matrix to ei’s
- Sort eivals decreasing order
5,select k eivecs corresp. to k dimensionality
- Construct projection matrix from W to k eivects
- transform d dimensional input using projection matrix W.
Variance explained ratios of eigenvalues is eival div by sum of eivals
A trasnformer class where we fit the model with training data, then transform the training data and test dataset with same model params.
LDA increases computational efficiency and reduces degree of overfitting from curse of dimensionality in non regularized madels
LDA finds feature subspace that maximizes class separability
LDA is supervised, PCA is unsupervised
- Standardize d-dimensional data
- Compute d-dimensional mean vector for each class
- Construct between-class scatter matrix and within class scatter matrix
- Compute the eigenvectors and corresponding eigenvalues of Sw^-1*Sb
- Sort Eivals by decreasing order
- Choose k eivectors corresponding to largest eivals to construct dxk W
- Project samples using transformation matrix W
.discriminant_analysis.LinearDiscriminantAnalysis
We define a nonlinear mapping function phi from Rd to Rk, Rk higher dimensional space
Mapping is expensive. Solution is Kernel Trick
Compute similarity between 2 high dimension feature vectors in the original feature space
Most commonly used kernels
Polynomial kernel
Sigmoid kernel
Radial Basis Function/Gaussian Kernel
click to edit
click to edit