Tree-based Learning

Tree-Pruning

Bootstrap Aggregating (Bagging)

Random Forest

sklearn.ensemble 🏁

RandomForestClassifier()/Regressor()

Parameters

n_estimators

criterion{“gini”, “entropy”}

max_depth=None

min_samples_split=2

min_samples_leaf=1

max_features='auto'

max_leaf_nodes=None

bootstrap=True

oob_score=False

Boosting pt.I

Gradient Boosting Machine

Boosting pt.II

XGB Features

Out-Of-Core Computing

Distributed Computing

Parallelization

Regression

Regularization

Bootstrap algorithm

Sample training data with replacement

Properties

Classification Trees

Algorithm

1) Divide predictor space X1,..., Xp into J distinct and non-overlapping regions R1,...RJ

2) For every observation that falls into region Rj, prediction = mean of response values for the training observations in Rj

Splitting Metrics

Classification

Entropy

H(E)=−∑c(j=1) (pj logpj)

Gini Impurity

probability that a randomly chosen sample in a node would​ be incorrectly labeled if it was labeled by : the distribution of samples in the node

Less Impurity = More Homogeneity

G = ∑C(i=1) (pi (1-pi)) where C=number of classes
(p^2+q^2 or 1-p^2-q^2)

Pros and Cons

Pros ✅

Cons 🚩

Easy to explain to people (Interpretability)

Small change in data => Large change in estimation

Overfitting

Graphically displayed

Handle Qualitative predictors without ❌ dummy variables

Algorithm

Out-Of-Bag Error

Without the ❌ need for cross-validation or validation dataset

OOB samples = Observations not in bootstrapped data

Less Entropy = More Homogeneity

Regression

Variance

Residual Sum of Squares RSS

RSS = ∑(j=1->J)∑(i(-Rj) (yi-yRj^)^2

∑((X-X¯)^2)/n where X¯=mean, n=number of values

Information gain = 1-Entropy ?

  1. Calculate Gini impurity (depending on the type of variable)

Binary Categorical

  1. If there is no parent node, take the split with the lowest gini impurity as root node
    If the parent node itself has the lowest score, no split => the parent node becomes leaf node
    If not, take the split with the lowest gini impurity

Continuous Numeric

1) Sort the values in the variable in ascending order​

2) Calculate average value for all adjacent values​

3) Calculate the Gini impurity for each average value​; <avg value= 1 split

5) Take the 'average' with minimum Gini impurity and split the data on it

Discrete Numeric

1) Calculate Gini impurity for all the discrete values except the largest value in the variable; <d1 = 1 split

Multi-value Categorical

1) Calculate Gini impurity for all values and all possible combinations except (V1 or V2 … or Vn) where n=number of values in the variable; V1 (or V2..) = 1 split

rf.feature_importances_ :

Individual tree => High variance; Forest of trees => Low variance without High bias;

Bagging

Boosting

XGBoost (Extreme)

Create a threshold such that impurity reduction has to be large enough to make a big difference

2) Calculate Weighted average of leaf node impurities of the split​

  1. Repeat the steps 1-2 until there are only leaves left.​

1) Calculate Gini impurity for each variable; 1 variable=1 split

2) Calculate Weighted average of leaf node impurities of the split​

2) Calculate Weighted average of leaf node impurities of the split​

dt2.png

dt8.png

dt9.png

dt10.png

dt4.png

dt5.png

dt6.png

dt7.png

4) Calculate Weighted average of leaf node impurities of the split​

dt11.png

dt12.png

dt13.png

dt14.png

dt1.png

dt3.png

Bootstrapped dataset size = Original dataset size

  1. Create a decision tree using the bootstrapped dataset but only use a random subset of variables at each step of building the tree
  1. Repeat the step 1 with new bootstrapped data for the predefined number of trees

Prediction

Regression

click to edit

Classification

Class of most votes of all the trees

Proportion of OOB samples that were incorrectly predicted/ classified

Choose the number of variables to use for building the tree with cross validation.

Missing Values

Filling Methods

Most common value

Average (mean, median, mode) [numerical]

Find another column with highest correlation

Deduce based on it (categorical)

Do linear regression on two columns and use least squares line to predict the unknown value (numerical)​

miss2.png

miss1.png

RF for Clustering

Regression Trees

Algorithm

1 predictor

>1 predictors

Cost Complexity Pruning

Weakest Link Pruning

  1. Build the full sized regression tree using ALL data (test+training); ​
  1. Calculate Tree Score & Increase alpha until pruning leaves ​will give a lower Tree Score​; Initial alpha=0;
  1. Drop the lowest leaves and repeat 1-2 until ​only 1 root node is left as sub-tree​
  1. Divide the data into testing and training​
  1. Using training data, use the alpha values for the full tree and the sub trees that minimize the Tree Score. (Example)​
  1. Calculate RSS for each tree using only Testing data​
  1. Take note of the tree and alpha with minimum RSS​
  1. Create new Training & Testing data
  1. Repeat steps 5-8 for defined num of n-fold CV

Final alpha = alpha that gives LOWEST AVERAGE RSS with test data​

Final tree = tree whose alpha is final alpha among OG trees/sub-trees built in steps 1-3

Weakest Link Pruning

  1. Take (nth,n+1th) smallest observations of the predictor variable (n=1 initially)​
  1. Take average of them & use that "average" as node to split for the tree with average value of observations on the left (observations < "average") as left child and average value of the observations on the right (observations < "average") as right child​
  1. Calculate the residual sum of squares RSS=(yi-y^i)^2 for all observations​
  1. Plot RSS against the threshold values of the predictor variable ("average") ​
  1. Set n=n+1 ​
  1. Repeat Steps 1-5 until n==N (N=number of observations)​
  1. Take the threshold with minimum RSS and split the tree.
  1. Repeat from 1-7 until num of observations in each leaf < min number of observations required to split the node. ​
  1. For each variable, do the steps of 1-7 of single variable algorithm and take the node with min RSS as candidate for node(i)
  1. Compare RSS for each of the candidate and take the node with lowest RSS as the node(i) of tree.​
  1. Grow the tree doing steps 1-2 until num of observations​ in each node(i)/(leaf) <= min num of observations required to split​ the node

reg2.png

reg3.png

reg4.png

reg5.png

reg6.png

  1. Calculate RSS for each tree with different depths.
  1. Take the tree with minimum RSS.

reg8.png

reg7.png

2) When alpha=a(i+1) [(10,000)], we will get a lower Tree Score if we prune the leaves and use the next sub-tree instead. (i=1 for first step)​

reg12.png

1) When alpha=0, we build a full sized tree, since it will have the lowest Tree Score.​

3) Repeat step 2 until i==M (M=number of trees/sub-trees)​

reg15.png

reg16.png

reg18.png

Tree Score = RSS + alphaT [Tree Complexity Penalty]​*

Alpha = Tuning parameter (cross validation)​

T = number of leaves

reg10.png

reg9.png

Classification

  1. Take average of all response values as the prediction of the first tree.​
  1. Calculate Pseudo Residual= Observation – Prediction
  1. Build a tree with defined number of leaves to predict the previous residuals from predictor variables.​
    Prediction/Output of each leaf = sum of residuals of all samples in the leaf/num of residuals in the leaf
  1. Prediction = 1st tree prediction + Σj=2->(i-1) (lr x jth tree prediction) + (lr x current tree prediction);
    where ith tree = current tree; lr=learning rate ​
  1. Repeat steps 2-4 until max number of trees or adding additional trees does not significantly reduce the residuals

Prediction = 1st tree prediction + Σj=2->M (lr x jth tree prediction)​
M=total number of trees; lr=learning rate

  1. Initial Prediction = ln(odds) for all the response values; odds=p/(1-p);
    Probability of event for each observation = e^prediction/(1+e^prediction); ​
  1. Calculate Residual = (Observation-Probability);
    If event happening, observation=1; if not, observation = 0. ​
  1. Build a tree with defined number of leaves to predict the residuals from the predictor variables.​
  1. Calculate probability of event for each observation;
  1. Repeat steps 2-5 until maximum number of trees or residuals get ​super small.

Prediction for each sample = 1st tree prediction + Σj=2->(i-1) (lr x jth tree prediction) + (lr x current tree prediction) ​;
where ith tree=current tree; lr = learning rate;​

Prediction of each leaf

Prediction = 1st tree prediction + Σj=2->M
(lr x jth tree prediction)
;
where M=total number of trees; lr=learning rate

reg1.png

reg13.png

reg11.png

reg14.png

reg17.png

Regression

Classification

gbr2.png

gbr1.png

gbr3.png

gbr4.png

gbr7.png

gbr5.png

gbr8.png

gbc2.png

gbc5.png

gbc3.png

gbc9.png

gbc11.png

  1. Initial prediction = 0.5. Calculate Residual= Actual-Prediction for each observation.​
  1. XGBoost Tree for Regression

1) Each tree starts out as single leaf w. all the residuals​

2) Calculate Quality/Similarity Score for parent root node​;

3) Take (nth,n+1th) smallest observations of the predictor variable (n=1 initially) of the parent node to be split

4) Take average & use that "average" as a condition (<"average") for the node to split with residuals on the left (observations < "average") as left child and residuals on the right (observations < "average") as right child​

5) Calculate SS for each leaf ​

6) Calculate Gain of tree.(w. split <"average")

7) n=n+1​

8) Repeat 3)-7) until n==N where N=total number of observations in the parent node to be split​

9) Use threshold (<"average") that gives the largest GAIN as the branch in the tree​

10) Repeat 3)-9) until it reaches max depth of tree (default=6 levels) OR min num of samples to be split [Cover]

  1. Calculate prediction for each sample.

Output/prediction of each leaf = Sum of residuals of all samples in leaf/(num of residuals+λ)

Prediction for each sample= 1st tree prediction + Σj=2->(i-1)​ (lr x jth tree prediction) + (lr x current tree prediction)​
where ith tree=current tree; lr = learning rate (default=0.3)

  1. Repeat #1-#3 until max num of trees OR residuals are super small.

Faster compared to other GBMs ⭐

AdaBoost

xgbr2.png

LightGBM

What

Nth model errors REDUCED by (N+1)th model

Fits model sequentially until no further improvements can be made

xgbr1.png

xgbr4.png

xgbr3.png

Similarity score=Squared Sum of all residuals in the leaf / (Num of residuals + λ)
Where λ=regularization parameter, default=0;

Gain=SS left child + SS right child – SS root

xgbr6.png

Tree Pruning

Tree Complexity

  1. Pick a number for γ (Tree complexity parameter)​

⏫ λ=> ⏬ SS => ⏬ Gain => ⏬ Pruning Threshold

tp2.png

If ancestor parent node<γ but descendent parent node>γ, keep the ancestor parent node

tp3.png

  1. If gain of the parent node > γ, keep that parent node. If not, remove/ prune that parent node/ that split

Cover

Set Cover=0 (Classification)​

What?

denominator of SS minus λ

2nd derivative of loss function

min_child_weight​

default=1(✅ Regression)

Min num of Residuals in each leaf to split

xgbr5.png

  1. Initial probability = 0.5; Initial prediction= ln(p/1-p) = 0​;
    Calculate Residual = Actual - Probability; Event happening = Actual = 1;
  1. XGBoost Tree for Classification

1) Each tree starts out as single leaf w. all the residuals​

2) Calculate Quality/Similarity Score for root node​

xgbc1.png

Where λ=regularization parameter, default=0; ​

3) Take (nth,n+1th) largest observations of the predictor variable (n=1 initially) of the parent node to be split ​

4) Take average of them & use that "average" as a condition (<"average") for the node to split with residuals on left (observations < "average") as left child and residuals on right (observations < "average") as right child

5) Calculate SS for each leaf ​​

6) Calculate Gain of tree (w. split "<average")

7) n=n+1​​

8) Repeat 3)-7) until n==N where N=total number of observations in the parent node to be split​​

​9) Use the threshold (<"average") that gives the largest GAIN as the branch in the tree​

10) Repeat 3)-9) until it reaches max depth of tree (default=6 levels) OR min num of samples to be split [Cover]

Gain=SS left child + SS right child – SS root​​

  1. Calculate prediction for each sample

1) Calculate output / prediction of each leaf in tree

xgbc5.png

2)Prediction for each sample= 1st tree prediction + Σj=2->(i-1) (lr x jth tree prediction) + (lr x current tree prediction);
where ith tree=current tree; lr = learning rate (default=0.3);

  1. Repeat #1-#3 until max num of trees OR residuals are super small.

3) Probability = e^(prediction)/(1+e^(prediction))

xgbc4.png

xgbc2.png

xgbc6.png

xgbc7.png

xgbc8.png

Gini impurity = 1 - Gini
n = number of classes

Screenshot-from-2021-03-22-15-41-25-e1616408089843.png

Screenshot-from-2021-03-22-15-45-19.png

Screenshot-from-2021-03-22-15-38-59-300x53.png

Screenshot-from-2021-03-22-15-45-28.png

Screenshot-from-2021-03-22-15-50-25-300x247.png

Cos Gini Impurity of Split on Class < Split on Performance => Split on Class ✅

Screenshot-from-2021-03-22-15-49-28.png

gbc6.png

Probability = e^(pred)/(1+e^(pred))

Information gain of decision split

image.png

e.g. image.png


information gain = 0.97095 - 12/20 0.65002 - 8/20 1

e.g. for leftmost leaf, output = -0.5 / (0.5 * 0.5) = -2