Please enable JavaScript.
Coggle requires JavaScript to display documents.
Overfitting and Its Avoidance (Overfitting (Tree Induction (The tree…
Overfitting and Its Avoidance
Generalization
Predict well for instances that have not been observed yet
table model
memorizes training data and uses no generalization
A model that is completely perfect with historical data is useless in practice
Where the model applies to data that was not used to build the model
We want models to apply not just to an exact training set but rather to the general population that the training data comes from
Overfitting
Finding chance occurrences that look like interesting patterns but do not generalize
Tendency of data mining procedures to tailor models to the training data
At the expense of generalization to previously unseen data points
Memorization is the most extreme overfitting procedure possible
All data mining techniques tend to do this, some more than others
If you look hard enough, you can always find patterns in a data set
Fundamental trade off between model complexity and possibility of overfitting
Best strategy is to recognize overfitting and manage complexity in a principled way
More complex models may capture real complexities of the application
May lead to more accuracy
How to recognize overfitting
Holdout Data
Must holdout data for which you know the value of the target variable but will not be used in creating the model
Not actual use data
Use data means you would like to predict the value of the target variable
Lab test of generalization performance
Hide values of this data from the model and maybe also the modelers
Then model will predict the values
Then we estimate generalization performance
Comparing predicted values with true values
"test set"
accuracy on training set is called "in sample" accuracy
Fitting Graphs
Accuracy of a model as a function of its complexity
Shows the difference between a modeling procedure's accuracy on the training data and the accuracy on the holdout data as model complexity changes
Generally more overfitting as model becomes more complex
Chance of overfitting increases as one allows the modeling procedure more flexibility in the models it can produce
With each new row in the training set, error decreases
Mathematical Functions
Adding more variables/attributes can make these more complex
Those creating the model can change the function from being truly linear in the original attributes by adding new attributes that are non linear versions of the original attributes
Tree Induction
The tree should be slightly better than the lookup table because every previously unseen instance will arrive at some classification
The tree will give a non trivial classification even for instances it has not seen before
Useful to examine empirically how well the accuracy on the training data tends to correspond to the accuracy on the test data
A procedure that grows trees until the leaves are pure tends to overfit
Why can it be bad?
As a model gets more complex it is allowed to pickup harmful spurious correlations
Every data set is a finite sample of a bigger population
There is no general analytic way to determine in advance whether a model has overfit or not
Avoid Overfitting
Parameter optimization
5 fold cross validation