Please enable JavaScript.
Coggle requires JavaScript to display documents.
Overfitting and its Avoidance (Overfitting (Need holdout data (Hide real…
Overfitting and its Avoidance
Generalization
Learning Curve
Becomes less steep w/ more data
Marginal advantage
Generalization performance plot
Amount of training data
Property of a model
Table Model
Memorizes training data
Performs no generalization
Useless for predictions w/ different data
Cross Validation
Split dataset into k partitions (folds)
Usually 5 or 10
Iterates training & testing k times
Better use of limited dataset
Every example used once
With Tree Induction
2 common techniques to avoid overfitting
Stop growing the tree before it gets too complex
Grow until its too large, then reduce
Simplest method to limit tree size
Specify minimum # of leaf instances
In Tree Induction
Data sets that aren't pure
Predict target variable based on averages
Holdout accuracy declines
Measures 2 values
Training set accuracy
Holdout data accuracy
Grows trees until leaves are pure (overfit)
Overfitting
Fitting Graph
Model accuracy as a function of complexity
Data mining procedures to tailor models
Manage complexity in fitting way
Negatives
Correlations produce incorrect generalizations
Need holdout data
Depends on complexity
Hide real values
Model predicts values
Test set
Mathematical functions
Increase dimensionality
Fit larger sets of arbitrary point
Find patterns that generalize
Model types susceptible to overfitting