Chapter 5: Overfitting and Its Avoidance Douglas Beighle (How to recognize…
Chapter 5: Overfitting and Its Avoidance Douglas Beighle
We want patterns that generalize, not just patterns that exist by chance
Overfitting: building the model too strongly on the training model at the expense of generalization. Simply because the model works on your training data, gives no indication of whether the model will work on unseen data.
How to recognize overfitting:
Checking generalization by see how strongly predicted values compare with holdout values
Accuracy of the model as a function of complexity
Pg 115 fitting graph
A general Approach to avoiding overfitting
Nested cross validation works but is complicated
Estimate generalization performance of each
Sequential Forward Selection (SFS)
Sequential Backward Elimination of Features
Overfitting in Tree Induction
Sweet Spot on graph between Holdout data and Training Data
Place a limit on the amount of instances that must be present for a tree to exist
No set number of instances, depends on model
Gain an idea via hypothesis testing
Stop growing the tree before it gets too complex
Or grow the tree and then slowly prune it back
Overfitting in Mathematical Functions
The more complexity, the more likely to be overfitted.
Avoid using too many variables.
Cross Validation double checks your model to show that you are not overfitting
Learning Curve: A plot of the generalization performance against the amount of training data
Not the same as a fitting graph. Remember. Fitting graphs show training data vs complexity.
Can Show that adding more training data is not worthwhile
Additional training data is only useful to a poin
Essentially overfitting stems from problems of multiple comparisons
Always a tradeoff between complexity of the model and overfitting.