Overfitting & Its Avoidance (Overfitting (All data mining procedures…
Overfitting & Its Avoidance
One of the most fundamental notions of data science is that of overfitting and generalization.
*) Table Approach:
memorizes the training data and performance no generalizations.
When a previously unseen customer's contact is about to expire, we will apply the model. But the customer was not part of the historical dataset, so the look up will fail. The model will predict 0% likelihood of churning for this customer.
We want models to apply not just to the exact training but to the general population from which the training data came.
it is the property of a model or modeling process, whereby the model applies to data that were not used to build a model.
Ffinding chance occurrences in data that look like interesting patterns, but which do not generalize.
is the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points.
All data mining procedures have the tendency to overfit to some extent some more than others.
The problem is not to use a data mining procedure that doesn’t overfit because all of them do. Nor is the answer to simply use models that produce less overfitting, because there is a fundamental trade-off between model complexity and the possibility of overfitting.
There is no single choice or procedure that will eliminate overfitting. The best strategy is to recognize overfitting and to manage complexity in a principled way.
Holdout Data & Fitting Graphs
*) Fitting Graph:
shows the accuracy of a model as a function of complexity. To examine overfitting the concept of holdout data is fundamental.
Evaluation on training data provides no assessment of how well the model generalizes to unseen cases.
We need to holdout some data for which we need to know the value of the target variable, but which will not be used to build the model.
Holdout data is like creating a "lab test" of generalization performance.
The model will predict the values. The estimate the generalization performance by comparing the predicted values with the hidden true values.
Differences Between Models
Training Set model
based from the graph below, the training data model is less complex than the generalization model. That means that is less accurate.
based from the graph above, this model is more complex and accurate. However, this model is overfitting.
The accuracy of a model it depends on how complex we allow it to be.
*) Base Rate:
classifier that always selects the majority class
Overfitting in Tree Induction
Tree induction has the ability to find important, predictive individual attributes repeatedly to smaller and smaller data subset.
If we continue to split the data, eventually the subset will become pure.
There will be multiple instances at a leaf, all with the same value for the target variable. If we have to, we can keep splitting on attributes, and subdividing our data until we’re left with a single instance at each leaf node, which is pure by definition.
What we have just built is a version of the lookup table, which is an example of overfitting.
What will be the accuracy training set?
It will be perfectly accurate, predicting correctly the class for every training instance.
This tree should be slightly better than the lookup table because every previously unseen instance will arrive at some classification, rather than just failing to match; the tree will give a nontrivial classification even for instances it has not seen before.
Tree structured Model
The are very flexible in what they can represent. However, it might required a huger order to do so. The complexity of the tree lies in the number of nodes.
In the graph:
Beginning at the left, the tree is very small and has poor performance. As it is allowed more and more nodes it improves rapidly, and both training-set accuracy and holdout-set accuracy improve.
But at some point the tree starts to overfit: it acquires details of the training set that are not characteristic of the population in general, as represented by the holdout set.
The overfitting starts when x = 100 nodes.
The holdout accuracy declines as the tree grows past its “sweet spot”; the data subsets at the leaves get smaller and smaller, and the model generalizes from fewer and fewer data.
This example represents the best trade-off between the extremes of (i) not splitting the data at all and simply using the average target value in the entire dataset, and (ii) building a complete tree out until the leaves are pure.
Overfitting in Mathematical Functions
There are different ways to allow more or less complexity in mathematical functions.
One way mathematical functions can become more complex is by adding more variables:
F(x) = W0 + W1X1 + W2X2 + W3X3
If we start to add more Xi's, then the function will become more complex. Because each Xi will have their corresponding Wi, which is a learned parameter of the model.
A dataset may end up with a very large number of attributes, and using all of them gives the modeling procedure much leeway to fit the training set.
Modelers will use a sort of holdout technique introduced above to assess the information in the individual attributes. Careful manual attribute selection is a wise practice in cases where considerable human effort can be spent on modeling, and where there are reasonably few attributes.
Why Overfitting is Bad?
A model that only memorizes is useless because it always overfits and is incapable of generalizing. But technically this only demonstrates that overfitting hinders us from improving a model after a certain complexity.
Why does performance degrade?
As a model gets more complex it is allowed to pick up harmful spurious correlations. These correlations are idiosyncrasies of the specific training set used and do not represent characteristics of the population in general. The harm occurs when these spurious correlations produce incorrect generalizations in the model.
From Holdout Evaluation to Cross-Validation
While a holdout set will indeed give us an estimate of generalization performance, it is just a single estimate.
It is a more sophisticated holdout training and testing procedure.
Evaluates some statistics on the estimated performance, such as the mean and variance, so that we can understand how the performance is expected to vary across datasets.
Also makes better use of a limited dataset.
Estimates over all the data by performing multiple splits and systematically swapping out samples for testing.
A plot og generalization performance against the amount of training data.
They are steep initially as the modeling procedure finds the most apparent regularities in the dataset. Then as the modeling procedure is allowed to train on larger and larger datasets, it finds more accurate models.
In some cases, the curve flattens out completely because the procedure can no longer improve accuracy even with more training data.
Differences between learning curves & fitting graphs
Shows the generalization performance — the performance only on testing data, plotted against the amount of training data used.
Shows the generalization performance as well as the performance on the training data, but plotted against model complexity.
Classification trees are a more flexible model representation than linear logistic regression. This means two things: for smaller data, tree induction will tend to overfit more.
The learning curve may show that generalization performance has leveled off so investing in more training data is probably not worthwhile; instead, one should accept the current performance or look for another way to improve the model, such as by devising better features.
Avoiding Overfitting with Tree Induction
That it will keep growing the tree to fit the training data until it creates pure leaf nodes. This will likely result in large, overly complex trees that overfit the data.
It has two techniques
1) Stop growing the tree before it gets too complex
2) Grow the tree until is too large, and then prune it back to reduce the size.
means to cut off the leafs and branches, then replaced them with leaves.
To estimate whether replacing a set of leaves or a branch with a leaf would reduce accuracy. If not, then go ahead and prune. The process can be iterated on progressive subtrees until any removal or replacement would reduce accuracy.
There are various techniques to accomplish both techniques
Limit tree size is to specify a minimum number of instances that must be present in a leaf.
Tree induction will automatically grow the tree branches that have a lot of data and cut short branches that have fewer data, thereby automatically adapting the model based on the data distribution.
What is the threshold?
There is no fixed number, although practitioners tend to have their own preferences based on experience.
1 more item...
General Method for Avoiding Overfitting
When there is a collection of models with different complexities, we could choose the best by estimating the generalization performance of each
The test data should be strictly independent of model building so that we can get an independent estimate of model accuracy.
We might want to estimate the ultimate business performance or best model from one family against best model from another family.
We perform the Nested Holdout Testing
Take the training set and split it again into a training subset and a testing subset. Then we can build models on this training subset and pick the best model based on this testing subset. Let’s call the former the sub-training set and the latter the validation set for clarity. The validation set is separate from the final test set, on which we are never going to make any modeling decisions.
Nested cross-validation is more complicated, but it works as you might suspect.
This idea of using the data to choose the complexity experimentally, as well as to build the resulting model, applies across different induction algorithms and different sorts of complexity.
Run with many different features sets, using this sort of nested holdout procedure to pick the best.
2 more items...
Avoiding Overfitting for Parameter Optimization
Finding the right balance between the fit to the data and the complexity of the model.
Instead of just optimizing the fit to the data, we optimize some combination of fit and simplicity. Models will be better if they fit the data better, but they also will be better if they are simpler.
Tries to optimize not just fit to the data, but also the combination of fit to the dat and simplicity of the model.
When model involves numeric parameters W
Finding the set of parameters that maximizes some objective functions indicating how well it fits the data
Complexity control via regularization works by adding to this objective function a penalty for complexity
The λ term is simply a weight that determines how much importance the optimization procedure should place on the penalty, compared to the data fit. At this point, the modeler has to choose λ and the penalty function.
Common Penalties that can be applied
Sum of the squares of the weights
2 more items...