Please enable JavaScript.
Coggle requires JavaScript to display documents.
Tree-based Learning, Weakest Link Pruning, Boosting pt.II, Bootstrap…
Tree-based Learning
Tree-Pruning
-
-
-
Cost Complexity Pruning
- Calculate RSS for each tree with different depths.
-
- Take the tree with minimum RSS.
-
-
Classification Trees
Algorithm
1) Divide predictor space X1,..., Xp into J distinct and non-overlapping regions R1,...RJ
2) For every observation that falls into region Rj, prediction = mean of response values for the training observations in Rj
- Calculate Gini impurity (depending on the type of variable)
-
- If there is no parent node, take the split with the lowest gini impurity as root node
If the parent node itself has the lowest score, no split => the parent node becomes leaf node
If not, take the split with the lowest gini impurity
-
-
-
-
Discrete Numeric
1) Calculate Gini impurity for all the discrete values except the largest value in the variable; <d1 = 1 split
-
-
Multi-value Categorical
1) Calculate Gini impurity for all values and all possible combinations except (V1 or V2 … or Vn) where n=number of values in the variable; V1 (or V2..) = 1 split
-
-
- Repeat the steps 1-2 until there are only leaves left.
-
-
Splitting Metrics
Classification
-
Gini Impurity
probability that a randomly chosen sample in a node would be incorrectly labeled if it was labeled by : the distribution of samples in the node
-
-
-
Regression
Variance
∑((X-X¯)^2)/n where X¯=mean, n=number of values
-
-
Regression Trees
Algorithm
1 predictor
- Take (nth,n+1th) smallest observations of the predictor variable (n=1 initially)
- Take average of them & use that "average" as node to split for the tree with average value of observations on the left (observations < "average") as left child and average value of the observations on the right (observations < "average") as right child
-
- Calculate the residual sum of squares RSS=(yi-y^i)^2 for all observations
- Plot RSS against the threshold values of the predictor variable ("average")
-
-
- Repeat Steps 1-5 until n==N (N=number of observations)
- Take the threshold with minimum RSS and split the tree.
- Repeat from 1-7 until num of observations in each leaf < min number of observations required to split the node.
>1 predictors
- For each variable, do the steps of 1-7 of single variable algorithm and take the node with min RSS as candidate for node(i)
-
- Compare RSS for each of the candidate and take the node with lowest RSS as the node(i) of tree.
-
- Grow the tree doing steps 1-2 until num of observations in each node(i)/(leaf) <= min num of observations required to split the node
-
-
Weakest Link Pruning
- Build the full sized regression tree using ALL data (test+training);
- Calculate Tree Score & Increase alpha until pruning leaves will give a lower Tree Score; Initial alpha=0;
-
-
-
-
- Drop the lowest leaves and repeat 1-2 until only 1 root node is left as sub-tree
-
- Divide the data into testing and training
- Using training data, use the alpha values for the full tree and the sub trees that minimize the Tree Score. (Example)
2) When alpha=a(i+1) [(10,000)], we will get a lower Tree Score if we prune the leaves and use the next sub-tree instead. (i=1 for first step)
-
1) When alpha=0, we build a full sized tree, since it will have the lowest Tree Score.
-
-
- Calculate RSS for each tree using only Testing data
- Take note of the tree and alpha with minimum RSS
-
- Create new Training & Testing data
-
- Repeat steps 5-8 for defined num of n-fold CV
-
-
-
-
-
-
-
Missing Values
Filling Methods
-
Average (mean, median, mode) [numerical]
-