Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 5: Overfitting and Its Avoidance (Generalization (Table Model…
Chapter 5: Overfitting and Its Avoidance
Generalization
Table Model
Memorizes the training data and performs no generalization
Generalization
Property of a model or modeling process, whereby the model applies to data that were not used to build the model
Overfit
Models that do not generalize at all beyond the data that were used to build it - it is tailored, or "fit", perfectly to the training data
Overfitting
Trade-off between model complexity and the possibility of overfitting
The best strategy is to recognize overfitting and to manage complexity in a principled way
Overfitting Examined
Holdout Data and Fitting Graphs
Fitting Graph
: Shows the accuracy of a model as a function of complexity
Holdout Data
: "Hold out some data for which we know the value of the target variable, but which will not be used to build the model
Like creating a "lab test" of generalization performance
We will hide from the model the actual values for the target on the holdout data
Generalization Performance
- Estimate by comparing the predicted values with the hidden true values
More overfitting when the model is more complex
Overfitting in Tree Induction
Any training instance given to the tree for classification will make its way down, eventually landing at the appropriate leaf
Will it generalize?
Tree should be slightly better than a lookup table because it will give all unseen data some form of classification
Rather than just failing every match
Overfitting in Mathematical Functions
Become complex with more variables
As you increase the dimensionality, you can perfectly fit larger and larger sets if arbitrary points
Companies that do data science-driven targeting of online display advertising can build thousands of models each week
Why is Overfitting Bad?
Every data set is a finite sample of a larger population
It will be necessary to have a holdout set to detect overfitting
Cross-Validation
: More sophisticated holdout training and testing procedure
Gives stats on estimated performance , mean and variance
Computes its estimates over all the data by performing multiple splits and swapping out samples for testing
The Churn Dataset Revisited
Average accuracy of the folds with classification trees
Means there is overfitting
Variation in the performances in the different folds
Good idea to average them to get a notion of the performance as well as the variation we can expect
Compare the fold accuracies between logistic regression and classification trees
Logistic regression shows lower accuracy with higher variation
Learning Curves
Learning Curve
: A plot of generalization performance against the amount of training data
Usually have a characteristic shape
Starts steep, then flattens out as the models become more accurate
Learning Curves
vs.
Fitting Graphs
Learning curve shows generalization performance, plotted against the amount of training data used
Fitting graphs shows the generalization performance as well as training data, plotted against complexity
Overfitting Avoidance and Complexity Control
Avoiding Overfitting with Tree Induction
Main problem
: It will keep growing the tree to fit the training data until it creates pure leaf nodes
Solution
: Limit tree size to specific a minimum number of instances that must be present in a leaf
A General Method for Avoiding Overfitting
Test data should be strictly independent of model building