Please enable JavaScript.
Coggle requires JavaScript to display documents.
Overfitting and Its Avoidance (Overfitting (Tendency of data mining…
Overfitting and Its Avoidance
Generalization
Property of a model/model process
Every dataset finite sample of population
Want models to apply to exact training & general population
Learning curves
plot of generalization performance against the amount of training data
Overfitting
Tendency of data mining procedures to tailor models to the training data at the expense of generalization to previously unseen data points
Pure memorizaiton
Most extreme overfitting procedure possible
All data mining have tendency to overfit to some extent
Idea is if we look hard enough we can detect patterns within the data
"If you torture the data long enough, it will confess."
Best strategy to recognize overfitting and manage complexity in a principled way
How to recognize overfitting?
Holdout data
Data originally not used to detect/build model
"Hold out" data at the beginning
Think at it like creating a "lab test"
Generalization performance by comparing predicted values & hidden true values
Also known as the "test set"
Cross-validation
More sophisticated holdout training & testing procedure
Better use of limited data
Computes estimates over all data
Splits labeled dataset into "k" partitions -folds
Fitting graph
Shows accuracy of model as function of complexity
Difference between modeling procedure's accuracy on training data and accuracy on holdout data as model complexity changes
Overfitting in Tree Induction
Perfectly accurate, predicting correctly the class for every training instance
Tree better than lookup table b/c previous unseen instance will arrive at some classification
Procedure that grows trees until leaves are pure, overfits
Complexity of trees lies in number of nodes
Overfitting in Mathematical Functions
Can become complex by adding more variables/attributes
Modelers can change function so it is not linear
Do it by adding non linear attributes
Can also dry attributes to avoid overfitting
Careful manual attribute selection if time friendly
Large data sets may not be feasible
Increase dimensionality, perfectly fit larger sets of arbitrary points
If not fit perfect, fits better and better
Why is Overfitting Bad?
As model gets more complex it allows to pick up harmful spurious correlations
Idiosyncracies
Do not represent characteristics of population in general
Two-class example
Avoidance & Complexity Control
Avoid by controlling complexity
Tree induction will keep growing to fit training data
2 techniques to avoid overfitting
Stop growing tree before gets too complex
Grow tree until it is too large then prune back
General avoidance
Test data strictly independent of model building
So we can get independent estimate of model accuracy
Realize nothing special about first training/test split
For parameter optimization
Nested-cross validation to chose appropriate weighted importance