Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 5: Overfitting and its avoidance (Overfittting (tailor models too…
Chapter 5: Overfitting and its avoidance
Overfittting
tailor models too closely to training data are not sustainable
most have a tendency to overfit to some extent
trade offs
Holdout data
fundamental evaluation tool
Not used in building the model
The model will hopefully predict these values
Then estimate the generalization performance
test of generalization performance
Fitting graph
shows the accuracy of a model as a function of complexity
Generally there will be more overfitting with more complex models
Overfitting in tree induction
tree should be slightly better than lookup table
Limit the size of the tree
All model types can be overfit
No specific technique will eliminate overfitting
Overfitting Mathematical functions
can become more complex by adding more variables
Prune attributes in order to avoid overfitting
Change the function from being truly linear in the original attributes by adding new attributes that are not linear
Overfitting Linear functions
parabola
Adding variables
Holdout evaulation
estimate accuracy
Cross Validation
more sophisticated holdout training and testing procedure
Mean and variance
Better use of limited data set
unlike splitting data into holdout and training
computes evidence over all data
Performs multiple splits and systematically swapping out samples
Learning curves
Training size changes expect different generalizations
Plot of generalization performance against the amount of training data
Shows generalization performance (only on testing data) plotted against the amount of training data used
A fitting graph shows
generalization performance and the performance on the training data
Overfitting avoidance and complexity control
main problem with tree induction
keep growing the tree to fit the training data until it creates a pure lead node
Two ways to fix
stop before the tree gets too complex
minimum instance stopping
grow the tree until it is too large then prune it
Determine threshold
how many leafs
hypothesis test
either stops tree or lets it keep growing
General method for overfitting
choose best by estimating the generalization performance
Training data subset
Spilt the training data
sub-training set and validation set
Nested cross validation
Sequential forward selection
nested holdout procedure to pick the best individual feature
Sequential backwards elimination feature
starting with all the features and discarding features one at a time
Avoiding overfitting parameter optimization
regularzation
Arg Max
Maximize the fit over all possible arguments
Good tree size and feature set can be chosen via nested cross validation