Please enable JavaScript.
Coggle requires JavaScript to display documents.
Overfitting and Its Avoidance: Chapter 5 (Overfitting (Holdout Data (Used…
Overfitting and Its Avoidance: Chapter 5
Overfitting
Tendency of procedures to tailor models to the training data
At expense of generalization
"If you torture the data long enough, it will confess"
Holdout Data
Used to evaluate the data
Fitting Graph
Shows accuracy of a model
Generalization performance
Tree Induction
Mathematical functions
Holdout Evaluation to Cross-Validation
Avoid being fooled by overfitting
Confidence in single estimates of model accuracy?
Computing confidence intervals
General testing procedures
Cross validation
Statistics on estimated performance
Mean and variance
Assesses confidence in performance estimate
Makes better use of a limited dataset
Multiple splits
Systematically swapping out samples ofr testing
Learning Curves
A plot of generalization performance against amount of training data
Important analytical tool
Have a charecteristic shape
Initially steep
As it trains on larger datasets, it finds more accurate models
Marginal advantage of having more data deceases, so learning curve becomes less steep
Difference between learning curves and fitting graphs
Learning curve shows generalization performance, only performance on testing data plotted against amount of training data used
A fitting graph shows the generalization performance and performance on training data, but plotted against model complexity
Avoiding Overfitting
Tree Induction
Keep growing the tree to fit data until it creates pure leaf nodes
Results in large, overly complex trees that overfit data
Stop growing tree before it gets too complex
Grow tree until it is too large
Limit tree size
General method
Choose best model by estimating generalization performance of each
Test data should be strictly independent of model building
Pick best model based on testing data
Sub-training set and validation set