Overfitting & Avoidance Ch. 5 (Overfitting (In tree Induction (a…
Overfitting & Avoidance Ch. 5
Finding chance occurrences in data that look like interesting patterns, but dont generalize.
The tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data.
All models tend to overfit to some extent
The best strategy is to recognize overfitting and manage complexity in a principle way.
In tree Induction
a procedure that grows trees until leaves are pure tends to overfit.
model applies to data that were not used to build the model
Holdout data and fitting graphs
Shows accuracy of model as a function of complexity
Unseen data set aside. "lab set"
Know value of target, but not used to build model
more sophisticated than holdout
not only a simple estimate of the generalization performance
but also some stats on the estimated performance
makes for better use of a limited dataset
multiple splits, swapping out samples for testing
Control the complexity of the models induced from the data
Employing explicit complexity penalties into the objective function used for modeling