Chapter 4 Building Training Sets + Preprocessing

Missing Values Problem

L1 and L2 regularization

Estimators-classifiers with an API similar to transformer class

Feature Scaling

Handling Categorical Data

scikit-learn Transformer classes, used for data transformation

Important for scale dependant algorithms, like GD.

Normalization - scale 0-1

2 Types: Nominal (category)

Ordinal (number)

kill missing values with pandas.df

Imputer, a transformer

key methods

have predict method

have transform method

Generalization Error Problem

L2 Regularization introduces a penalty for large individual weights

Topics

Remove/impute missing values

Get categorical data into shape via one hot

Select relevant features

.isnull().sum() counts missing values per column

dropna(axis=0) would kill rows with missing data. dropna(axis=1) kill columns. dropna(how='all') kills rows where all are na. (thresh=x) drop rows with fewer than x real values
dropna(subset=['Name']) drop rows with missing data in 'Name' feature

Mean Imputation, average the feature column and replace the missing value

sk.preprocessing.Imputer; imr=Imputer(missing_values='NaN',strategy='mean',axis=0); imr.fit(df.values)

fit- learn parameters from training data

transform; any array that is transformed needs to have n features=data array for fitting

Use one hot encoding for each category but one, to avoid problems with non-invertible matrices (dependence)

Eg, size of clothing

create size mapping from size to numebr

sklearn.preprocessing.LabelEncoder; cle=LabelEncoder(); y=cle.fit_transform(df['label'].values)

LabelEncoder class in sklearn.preprocessing for encoding labels

.preprocessing.OneHotEncoder(categorical_features=[column_num])

Get_dummies in pandas creates onehots, use drop_first=True to kill multicollinearity

Standardization - divide by std dev

.preprocessing.MinMaxScaler

preprocessing.StandardScaler

Collect more data

Introduce a complexity penalty

Choose a simpler model with fewer params

Reduce data dimensinoality

L1 Regularization varies from L2 by replacing the square of the weights with their absolute values

L1 Reg yields sparse feature vectors, with most weights zero. Useful when many features are irrelevant.

.learnear_model.LogisticRegression(penalty='l1')

Dimensionality Reduction

Feature Selection-select a subset of features

Feature Extraction-derive information to create a new feature subspace

Sequential Feature Selection Algorithms are Greedy, reduce d-dimensional space by selecting most relevant features

Sequential Backward Selection (SBS) minimizes dimensionality of feature subspace with minimum decay

Steps: Initialize, choose number of features for feature space

  1. Determine x- to maximize J(X-k)
  1. Remove x-

Terminate if k is small enough

Assess feature importance with Random Forests

Measure feature importance from averaged impurity decrease computed from all decision trees

access by feature_importances attribute after fitting RFC.

❗ note that if 2 features are correlated, the more relevant one will be ranked highly, while the other will be undervalued

Ch5 Compressing Data via Dimensionality Redux

Unsupervised dimensionality reduction via principal component analysis

Main steps to PCA

Extracting PC step by step

Total and explained Variance

Feature Transformation

PCA in scikit

Supervised Data Compression via linear Discriminant analysis

PCA vs LDA

Inner workings of LDA: steps

Compute Scatter Matrices

Select Linear Discriminants for new feature subspace

Projecting samples onto new feature space

LDA via scikit

Using Kernel principal component analysis for nonlinear mappings

Kernel functions and kernel trick

Implementing KPCA analysis in python

Projecting new datapoints

KPCA in scikit learn

Summarize Data by transforming to lower dimensionality

Apps include exploratory data analysis and de-noising of signals

Identify patterns in data based on correlation between features, by finding directions of max variance in high dim data, project to new axes

Highly sensitive to Data Scaling

Steps

  1. Standardize d-dimensional dataset
  1. Construct covariance matrix
  1. Decompose covariance matrix to ei’s
  1. Sort eivals decreasing order

5,select k eivecs corresp. to k dimensionality

  1. Construct projection matrix from W to k eivects
  1. transform d dimensional input using projection matrix W.

Variance explained ratios of eigenvalues is eival div by sum of eivals

A trasnformer class where we fit the model with training data, then transform the training data and test dataset with same model params.

LDA increases computational efficiency and reduces degree of overfitting from curse of dimensionality in non regularized madels

LDA finds feature subspace that maximizes class separability

LDA is supervised, PCA is unsupervised

  1. Standardize d-dimensional data
  1. Compute d-dimensional mean vector for each class
  1. Construct between-class scatter matrix and within class scatter matrix
  1. Compute the eigenvectors and corresponding eigenvalues of Sw^-1*Sb
  1. Sort Eivals by decreasing order
  1. Choose k eivectors corresponding to largest eivals to construct dxk W
  1. Project samples using transformation matrix W

.discriminant_analysis.LinearDiscriminantAnalysis

We define a nonlinear mapping function phi from Rd to Rk, Rk higher dimensional space

Mapping is expensive. Solution is Kernel Trick

Compute similarity between 2 high dimension feature vectors in the original feature space

Most commonly used kernels

Polynomial kernel

Sigmoid kernel

Radial Basis Function/Gaussian Kernel

click to edit

click to edit