Please enable JavaScript.
Coggle requires JavaScript to display documents.
Overfitting & Avoidance Ch. 5 (Overfitting (In tree Induction (a…
Overfitting & Avoidance Ch. 5
Overfitting
Finding chance occurrences in data that look like interesting patterns, but dont generalize.
The tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data.
All models tend to overfit to some extent
The best strategy is to recognize overfitting and manage complexity in a principle way.
In tree Induction
a procedure that grows trees until leaves are pure tends to overfit.
Generalization
modeling process
model applies to data that were not used to build the model
Recognizing OF
Holdout data and fitting graphs
Fitting graph
Shows accuracy of model as a function of complexity
Holdout
Unseen data set aside. "lab set"
Know value of target, but not used to build model
cross-validation
more sophisticated than holdout
not only a simple estimate of the generalization performance
but also some stats on the estimated performance
makes for better use of a limited dataset
multiple splits, swapping out samples for testing
OF Avoidance
Control the complexity of the models induced from the data
Model Regularization
tree pruning
feature selection
Employing explicit complexity penalties into the objective function used for modeling