Tree-based Learning
Tree-Pruning
Bootstrap Aggregating (Bagging)
Random Forest
sklearn.ensemble 🏁
RandomForestClassifier()/Regressor()
Parameters
n_estimators
criterion{“gini”, “entropy”}
max_depth=None
min_samples_split=2
min_samples_leaf=1
max_features='auto'
max_leaf_nodes=None
bootstrap=True
oob_score=False
Boosting pt.I
Gradient Boosting Machine
Boosting pt.II
XGB Features
Out-Of-Core Computing
Distributed Computing
Parallelization
Regression
Regularization
Bootstrap algorithm
Sample training data with replacement
Properties
Classification Trees
Algorithm
1) Divide predictor space X1,..., Xp into J distinct and non-overlapping regions R1,...RJ
2) For every observation that falls into region Rj, prediction = mean of response values for the training observations in Rj
Splitting Metrics
Classification
Entropy
H(E)=−∑c(j=1) (pj logpj)
Gini Impurity
probability that a randomly chosen sample in a node would be incorrectly labeled if it was labeled by : the distribution of samples in the node
Less Impurity = More Homogeneity
G = ∑C(i=1) (pi (1-pi)) where C=number of classes
(p^2+q^2 or 1-p^2-q^2)
Pros and Cons
Pros ✅
Cons 🚩
Easy to explain to people (Interpretability)
Small change in data => Large change in estimation
Overfitting
Graphically displayed
Handle Qualitative predictors without ❌ dummy variables
Algorithm
Out-Of-Bag Error
Without the ❌ need for cross-validation or validation dataset
OOB samples = Observations not in bootstrapped data
Less Entropy = More Homogeneity
Regression
Variance
Residual Sum of Squares RSS
RSS = ∑(j=1->J)∑(i(-Rj) (yi-yRj^)^2
∑((X-X¯)^2)/n where X¯=mean, n=number of values
Information gain = 1-Entropy ?
- Calculate Gini impurity (depending on the type of variable)
Binary Categorical
- If there is no parent node, take the split with the lowest gini impurity as root node
If the parent node itself has the lowest score, no split => the parent node becomes leaf node
If not, take the split with the lowest gini impurity
Continuous Numeric
1) Sort the values in the variable in ascending order
2) Calculate average value for all adjacent values
3) Calculate the Gini impurity for each average value; <avg value= 1 split
5) Take the 'average' with minimum Gini impurity and split the data on it
Discrete Numeric
1) Calculate Gini impurity for all the discrete values except the largest value in the variable; <d1 = 1 split
Multi-value Categorical
1) Calculate Gini impurity for all values and all possible combinations except (V1 or V2 … or Vn) where n=number of values in the variable; V1 (or V2..) = 1 split
rf.feature_importances_
:
Individual tree => High variance; Forest of trees => Low variance without High bias;
Bagging
Boosting
XGBoost (Extreme)
Create a threshold such that impurity reduction has to be large enough to make a big difference
2) Calculate Weighted average of leaf node impurities of the split
- Repeat the steps 1-2 until there are only leaves left.
1) Calculate Gini impurity for each variable; 1 variable=1 split
2) Calculate Weighted average of leaf node impurities of the split
2) Calculate Weighted average of leaf node impurities of the split
4) Calculate Weighted average of leaf node impurities of the split
Bootstrapped dataset size = Original dataset size
- Create a decision tree using the bootstrapped dataset but only use a random subset of variables at each step of building the tree
- Repeat the step 1 with new bootstrapped data for the predefined number of trees
Prediction
Regression
click to edit
Classification
Class of most votes of all the trees
Proportion of OOB samples that were incorrectly predicted/ classified
Choose the number of variables to use for building the tree with cross validation.
Missing Values
Filling Methods
Most common value
Average (mean, median, mode) [numerical]
Find another column with highest correlation
Deduce based on it (categorical)
Do linear regression on two columns and use least squares line to predict the unknown value (numerical)
RF for Clustering
Regression Trees
Algorithm
1 predictor
>1 predictors
Cost Complexity Pruning
Weakest Link Pruning
- Build the full sized regression tree using ALL data (test+training);
- Calculate Tree Score & Increase alpha until pruning leaves will give a lower Tree Score; Initial alpha=0;
- Drop the lowest leaves and repeat 1-2 until only 1 root node is left as sub-tree
- Divide the data into testing and training
- Using training data, use the alpha values for the full tree and the sub trees that minimize the Tree Score. (Example)
- Calculate RSS for each tree using only Testing data
- Take note of the tree and alpha with minimum RSS
- Create new Training & Testing data
- Repeat steps 5-8 for defined num of n-fold CV
Final alpha = alpha that gives LOWEST AVERAGE RSS with test data
Final tree = tree whose alpha is final alpha among OG trees/sub-trees built in steps 1-3
Weakest Link Pruning
- Take (nth,n+1th) smallest observations of the predictor variable (n=1 initially)
- Take average of them & use that "average" as node to split for the tree with average value of observations on the left (observations < "average") as left child and average value of the observations on the right (observations < "average") as right child
- Calculate the residual sum of squares RSS=(yi-y^i)^2 for all observations
- Plot RSS against the threshold values of the predictor variable ("average")
- Set n=n+1
- Repeat Steps 1-5 until n==N (N=number of observations)
- Take the threshold with minimum RSS and split the tree.
- Repeat from 1-7 until num of observations in each leaf < min number of observations required to split the node.
- For each variable, do the steps of 1-7 of single variable algorithm and take the node with min RSS as candidate for node(i)
- Compare RSS for each of the candidate and take the node with lowest RSS as the node(i) of tree.
- Grow the tree doing steps 1-2 until num of observations in each node(i)/(leaf) <= min num of observations required to split the node
- Calculate RSS for each tree with different depths.
- Take the tree with minimum RSS.
2) When alpha=a(i+1) [(10,000)], we will get a lower Tree Score if we prune the leaves and use the next sub-tree instead. (i=1 for first step)
1) When alpha=0, we build a full sized tree, since it will have the lowest Tree Score.
3) Repeat step 2 until i==M (M=number of trees/sub-trees)
Tree Score = RSS + alphaT [Tree Complexity Penalty]*
Alpha = Tuning parameter (cross validation)
T = number of leaves
Classification
- Take average of all response values as the prediction of the first tree.
- Calculate Pseudo Residual= Observation – Prediction
- Build a tree with defined number of leaves to predict the previous residuals from predictor variables.
Prediction/Output of each leaf = sum of residuals of all samples in the leaf/num of residuals in the leaf
- Prediction = 1st tree prediction + Σj=2->(i-1) (lr x jth tree prediction) + (lr x current tree prediction);
where ith tree = current tree; lr=learning rate
- Repeat steps 2-4 until max number of trees or adding additional trees does not significantly reduce the residuals
Prediction = 1st tree prediction + Σj=2->M (lr x jth tree prediction)
M=total number of trees; lr=learning rate
- Initial Prediction = ln(odds) for all the response values; odds=p/(1-p);
Probability of event for each observation = e^prediction/(1+e^prediction);
- Calculate Residual = (Observation-Probability);
If event happening, observation=1; if not, observation = 0.
- Build a tree with defined number of leaves to predict the residuals from the predictor variables.
- Calculate probability of event for each observation;
- Repeat steps 2-5 until maximum number of trees or residuals get super small.
Prediction for each sample = 1st tree prediction + Σj=2->(i-1) (lr x jth tree prediction) + (lr x current tree prediction) ;
where ith tree=current tree; lr = learning rate;
Prediction of each leaf
Prediction = 1st tree prediction + Σj=2->M
(lr x jth tree prediction);
where M=total number of trees; lr=learning rate
Regression
Classification
- Initial prediction = 0.5. Calculate Residual= Actual-Prediction for each observation.
- XGBoost Tree for Regression
1) Each tree starts out as single leaf w. all the residuals
2) Calculate Quality/Similarity Score for parent root node;
3) Take (nth,n+1th) smallest observations of the predictor variable (n=1 initially) of the parent node to be split
4) Take average & use that "average" as a condition (<"average") for the node to split with residuals on the left (observations < "average") as left child and residuals on the right (observations < "average") as right child
5) Calculate SS for each leaf
6) Calculate Gain of tree.(w. split <"average")
7) n=n+1
8) Repeat 3)-7) until n==N where N=total number of observations in the parent node to be split
9) Use threshold (<"average") that gives the largest GAIN as the branch in the tree
10) Repeat 3)-9) until it reaches max depth of tree (default=6 levels) OR min num of samples to be split [Cover]
- Calculate prediction for each sample.
Output/prediction of each leaf = Sum of residuals of all samples in leaf/(num of residuals+λ)
Prediction for each sample= 1st tree prediction + Σj=2->(i-1) (lr x jth tree prediction) + (lr x current tree prediction)
where ith tree=current tree; lr = learning rate (default=0.3)
- Repeat #1-#3 until max num of trees OR residuals are super small.
Faster compared to other GBMs ⭐
AdaBoost
LightGBM
What
Nth model errors REDUCED by (N+1)th model
Fits model sequentially until no further improvements can be made
Similarity score=Squared Sum of all residuals in the leaf / (Num of residuals + λ)
Where λ=regularization parameter, default=0;
Gain=SS left child + SS right child – SS root
Tree Pruning
Tree Complexity
- Pick a number for γ (Tree complexity parameter)
⏫ λ=> ⏬ SS => ⏬ Gain => ⏬ Pruning Threshold
If ancestor parent node<γ but descendent parent node>γ, keep the ancestor parent node
- If gain of the parent node > γ, keep that parent node. If not, remove/ prune that parent node/ that split
Cover
Set Cover=0 (Classification)
What?
denominator of SS minus λ
2nd derivative of loss function
min_child_weight
default=1(✅ Regression)
Min num of Residuals in each leaf to split
- Initial probability = 0.5; Initial prediction= ln(p/1-p) = 0;
Calculate Residual = Actual - Probability; Event happening = Actual = 1;
- XGBoost Tree for Classification
1) Each tree starts out as single leaf w. all the residuals
2) Calculate Quality/Similarity Score for root node
Where λ=regularization parameter, default=0;
3) Take (nth,n+1th) largest observations of the predictor variable (n=1 initially) of the parent node to be split
4) Take average of them & use that "average" as a condition (<"average") for the node to split with residuals on left (observations < "average") as left child and residuals on right (observations < "average") as right child
5) Calculate SS for each leaf
6) Calculate Gain of tree (w. split "<average")
7) n=n+1
8) Repeat 3)-7) until n==N where N=total number of observations in the parent node to be split
9) Use the threshold (<"average") that gives the largest GAIN as the branch in the tree
10) Repeat 3)-9) until it reaches max depth of tree (default=6 levels) OR min num of samples to be split [Cover]
Gain=SS left child + SS right child – SS root
- Calculate prediction for each sample
1) Calculate output / prediction of each leaf in tree
2)Prediction for each sample= 1st tree prediction + Σj=2->(i-1) (lr x jth tree prediction) + (lr x current tree prediction);
where ith tree=current tree; lr = learning rate (default=0.3);
- Repeat #1-#3 until max num of trees OR residuals are super small.
3) Probability = e^(prediction)/(1+e^(prediction))
Gini impurity = 1 - Gini
n = number of classes
Cos Gini Impurity of Split on Class < Split on Performance => Split on Class ✅
Probability = e^(pred)/(1+e^(pred))
Information gain of decision split
e.g. for leftmost leaf, output = -0.5 / (0.5 * 0.5) = -2