Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 5 (overfitting (Table model (Memorizes training data no…
Chapter 5
overfitting
Chance occurrences that look like patterns
Table model
Memorizes training data no generlization
In practice predicts no churn
Generlization
model applies to data not used in model creation
Tailor models to training data
look hard enough, find patterns
Overfitting examined
fitting graph
accuracy of model as function of complexity
Shows accuracy on training and holdout when complexity changes
Base rate
Holdout data
Know the value of target variable but not used in model creation
Acts as lab test
Known as test set
Holdout Evaluation
Holdout data gives generalization predictions
just one set
Cross validation
multiple splits and sample swapping for testing
splits data into folds
Avoiding overfitting
tree induction
Problem: tree keeps growing till pure leaves
Solution
limit tree size
limit instances present in leaf
Make tree large then trim
If changing branch with leave doesnt change accuracy then trim
General
nothing special about first training/split
split training into training and testing
Testing
Validation set
Training
Subtraining set
Nested holdout testing
Sequential forward selection
nested holdout precedure
picks best feature
then best pair and so on
process stops when new feature doesnt improve accuracy
Sequential backward selection
works like SFS but in reverse
Overfitting in tree induction
Create individual leaves each node
Same as table example
Leads to overfitting
could generalize
Why is overfitting bad?
more complex= more harmful correlations
could be unique to training data not populatoin
Every model can be overfitted
training data is a portion of population
No way to tell if model is overfit
use holding data
Modeling Laboratory
building laboratory costly and timely
but aspects can be evaluated quicker in lab
some models don't work in real as in lab
training and deployment populations are different