Please enable JavaScript.
Coggle requires JavaScript to display documents.
Decision Trees - Coggle Diagram
Decision Trees
About
why to use it
- highly intuitive and interpretable in nature
- mimic the human decision making process
- no need of normalisation as we only need comparison
why not to use it
- tends to overfit more, thus high variance
- unstable due to high variance
Regression with DTs
- if we do have specific problem then DTs can be used for regression such as predicting weights based on age
- Here DTs can be used to divide the dataset based on age & then we can use regression on subset of data to predict weight, means instead of labels in leaf we will have regression models
What?
- Decision Trees are used to model non linear data, but they do not attempt to find the linear relationship as in other algos such as SVM, Log Reg.
- every intermediate node represents some test on attributes, leaf node represents decisions class labels
- convention is to follow left path if test pass otherwise go right
- path from the root to a leaf corresponds to a conjunction of tests at each of the nodes
Homogeneity
- idea is to split the data to increase homogeneity
Gini Index
- uses probability to find out the best attribute which increase homogenity
- dataset is split using each attribute and their resultant gini index is analysed to find out the best attribute.
- ranges from 0 to 1, gini index =1 means completely homogeneous
- formula - summation of square of pi, here i is class labels, remember gini index is calculated for multiple splits and summed up.
Information Gain/Entropy
Entropy
- it measures the disorder in dataset, attribute will split the dataset to reduce the disorder in dataset
- formula - it uses -(summation of pi*log(pi)), where i is class label
Information Gain
- In practice this measure is used which is nothing but the difference in the entropy of dataset after a split on attribute A(i.e gain in entropy)
- idea is to find A which minimises the E, therefore IG = E - E(A) and gain should be positive
R-Square
- this measure can be used in case of continous variables
- it will serve the same purpose i.e split on A which results in most gain in R-square(more R2 means more the homogeneous data)
- In other words, the fit of the model should be as ‘good’ as possible after splitting.
Reduce overfitting
Truncation
- it is performed when we are in the process of creating a DT
- idea is to stop before growing
- we can use measures such as depth, size of partitions, number of leafs, change in homogeneity measure
Pruning
- in practice, pruning is used more than the truncating
- pruning is performed after creation of tree, using measures such as depth, size of partitions at node & leaf, number of leafs, max features to consider when splitting & even Gini/IG
- after this measure part of tree is deleted and most probable class label is assigned and in case of regression avg measures can be used
-