Please enable JavaScript.
Coggle requires JavaScript to display documents.
Step 4 (Part1): Trees (Classification Tree (Cross Validation (Code Cross…
Step 4 (Part1): Trees
Classification Tree
ADVANTAGES
- Quick to run and intuitive
- Easy to recognize break points in continuous variables and nonlinear interactions between variables
DISADVANTAGES
- Unstable
- Prone to overfitting
- Tend to have high variance
TREE CODE
- rpart(formula, data, method, control, parms)
formula = as.factor(target) ~ var1 + var2 + var3 ...
formula all vars = as.factor(target) ~ .
method = "class", "poisson", "exp", "anova"
control = rpart.control ( minsplit,minbucket,cp,maxcompete, maxsurrogate, xval, surrogatestyle, maxdepth )
parms = list( split = "gini" or "information") default is gini
TREE CONTROL SUBCODE
- rpart.control(minsplit, minbucket, cp, maxcompete, maxsurrogate,usesurrogate, xval, surrogatestyle, maxdepth)
minsplit = Min # observations / node for split to be attempted
minbucket = min # observations in leaf node [around 5]
cp=split that doesn't decrease overall "lack of fit" by factor of cp is not attempted [around 0.001]
maxcompete = # competitor splits retained in output (checks alternative split)
maxsurrogate= #surrogate splits retained in output (splits w/ NA)
xval = # of cross-validations
maxdepth = max depth of any node of final tree with root node =0, controls complexity [6 or less]
PRUNE CODE
cp=tree$cptable[which.min(tree$cptable[,"xerror"]),"cp"]
PLOTS
- rpart.plot
- plotcp
- printcp
Cross Validation
Code Cross-Validation
- train(x, y , data, method="rpart", metric, trControl, tuneGrid, tree, importance, na.action)
x = factor(data.train$value_flag) (matrix of target variable)
y = data.train[varlist, ] where varlist=c("V1" , "V2" , "V3",...). (Vector containing outcome for every sample)
Method = "rf" default for random forest (put "rpart" for tree)
Metric = "RMSE" and "Rsquared" for regression, "Accuracy" for classification
trControl = trainControl
TuneGrid = expand.grid(cp=seq(0 , 0.01 , 0.0005)) (data frame with tuning values)
Na.action = na.omit (rejects missing values)
Parms = list( split= "gini")
- trainControl (method = "cv", number, repeats, sampling)
Method = "boot", "cv", "repeatedcv" (repeated cross-validation)
Number = # folds (normally 6)
Repeats = # repetitions of n-folds (not used usually)
DEFINITION CV
- Divides data into folds
- Trains the model on all but one of the folds
- Measures performance on remaining fold
- Process repeated to develop distribution of performance values for given CP value
- CP value yielding best accuracy is selected
Output of TRAIN
- train$results : Data frame with each tuned parameter (cp=0, 0,001, 0,002,etc..) and accuracy for each model
- train$finalmodel : Fit object using best parameters
- train$BestTune : Best CP withe best accuracy (highest)
- plot (trained tree) : Accuracy per each cp value
- rpart.plot ( trained.tree$finalModel) : Gives tree plot for best model with cp associated
Random Forest
ADVANTAGES
- Higher potential for predictive accuracy due to different starting points (many trees)
- Detects nonlinear interactions between predictors when finding variable importance
- Less prone to overfitting vs GMBs
- Incorporates automatically variable interaction
-
CODE
- randomforest(formula, data, ntree, mtry, sampsize, nodesize, importance=FALSE)
ntree = the more, the better. (around 500)
mtry = #variables rand. sampled as candidates @ each split
sampsize = #obs per tree
nodesize = min size of temrinal nodes (controls complexity) (around 5)
Importance = Should importance of predictors be assessed?
Cross Validation
- train( y, x, method="rf", trControl=TrainControl, TuneGrid, ntree, nodesize=5, maxdepth=5, importance)
X = factor(data.train$value_flag) (matrix of target variable)
Y = data.train[varlist, ] where varlist=c("V1" , "V2" , "V3",...).
TrControl=trainControl( method="cv", number=5)
TuneGrid=expand.grid (mtry=c(1:3)) (
Ntree=100 (default=500)
Nodesize= 5
Importance = TRUE
OUTPUT Trained Random Forest
- Rf.trained$Importance : Analyzes structure of model and ranks contribution of each feature. [varImp(model.trained)] Features used to make more frequent splits and earlier in model are deemed to be more important.
Definition Random Forest
- Uses bagging to produce many trees
- Only randomly selected subset of variables is considered when a split is being considered
- Each tree built from a bootstrap sample of the data
- Reduces overfitting
Boosted Tree
Definition Boosted Tree
- Boosted trees are built sequentially
- First tree built usual way, then next tree fit to residuals (error) and added to the tree
- Allows second tree to focus on observations on which first tree performed poorly
- Cross-validation used to determine when to stop adding additional trees (to prevent overfitting)
ADVANTAGES
- Reduces model bias
- Detects nonlinear interactions between predictors
DISADVANTAGES
- More prone to overfitting versus random forests
- More sensitive to hyperparameters