Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost - Chapter 5 (Overfitting (Overfitting in Tree Induction (A…
Provost - Chapter 5
Overfitting
Tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points
-
-
-
-
-
Why is overfitting bad?
-
An overfit tree, which fits the training data better, has worse generalization accuracy because the extraneous structure makes suboptimal predictions.
A plot of the generalization performance against the amount of training data is called a learning curve.
-
Cross-Validation
-
-
Not only a simple estimate of the generalization performance but also statistics on estimated performance such as the mean and variance.
Computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing.
-
Purpose is to use the original labeled data efficiently to estimate the performance of a modeling procedure.
All of the folds are used for testing at one point - every example will have been used only once for testing but k–1 times for training.
Generalization
-
EG. Only making a model to search if a customer will churn based on previously churned customers - not a good model - will always say no if that customer hasn't been there before.
-
-
-