Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 5: Overfitting and Its Avoidance (Sidenotes (Accuracy of a model…
Chapter 5: Overfitting and Its Avoidance
Context
Generalization
Table Model
Memorizes training data and performs no generalizations
A model that looked perfect could be completely useless in practice
Generalization = property of a model or modeling process, whereby the model applies to data that were not used to build the model
Want models that apply to the general population as well
Overfitting
Overfitting = tendency of data mining procedures to tailor models to training data
Done at expense of generalization to unseen data points
Recognize overfitting and manage complexity
Fitting graph = shows accuracy of a model as a function of complexity
X-axis measures complexity of model
Y-axis measures the error
Generalization performance as well as performance on the training data, but plotted against model complexity
Holdout data = data that is held out from initial model creation
Used to estimate generalization performance
Sometimes called "test set"
Greater complexity leads to greater overfitting
Greater overfitting of training set results from more variables
Overfitting often causes models to become worse
Picked up idiosyncrasies of the data-set that do not represent the general population
Generalizations due to idiosyncrasies
Effects all model types
No general way to determine overfitting level in advance
Tree Induction
Continue to split data and it will become pure
Procedure that grows trees until leaves are pure tends to overfit
This is also a problem with the data -> it becomes too overfitting
Complexity of tree lies in # of nodes
Extremely flexible modeling
Avoiding Overfitting
Stop growing tree before it gets too complex
Grow tree until it is too complex, then prune the leaves, reducing size and complexity
Cross Validation
Sophisticated holdout testing procedure
Studies variance critical to assessing model confidence
Estimates are computed over all data
Multiple splits and systematic swaps
k partitions called folds
Standard deviation may vary and an average of them may need to be attained (very simply)
Typically 5 or 10 times
Each iteration creates one model and one estimate of generalization perfromance
Iterates training and testing k times, in a particular way
Can compute average and standard deviation
Learning Curves
All else equal, generalization performance of data-driven modeling generally improves as more training data become more available, to a point
Plot of this correlation is called a learning curve
Steep initial shape until the gradually become less steep, may even become flat
Generalization performance plotted against the amount of training data used
Model Regularization
Method for reigning in complexity to avoid overfitting
Business Application
Modeling labratory
Expensive but typically very worth it
Work to understand actual use scenarios so as to make lab setting as true as possible
Data as an asset
Investments in additional training data is likely not worthwhile
Sidenotes
Fundamental trade-off between model complexity and possibility of overfitting
Recognize overfitting and manage complexity
Accuracy of a model depends on complexity
Greater complexity leads to greater overfitting
Mathematical functions can become more complex through addition of more variables
Use nonlinear versions of attribute variables to disrupt true linearity
Mistrust any performance measurement done on the training set
Because overfitting is a very real possibility
Compare fold accuracies of logistic regression and classification trees
All else equal, generalization performance of data-driven modeling generally improves as more training data become more available, to a point
Test data should be independent of model building