Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 5: Overfitting and Its Avoidance (Overfitting (Overfitting…
Chapter 5: Overfitting and Its Avoidance
Generalization
Table Model
Memorizes training data and performs no generalization
Fundamental Concepts of Data Science
Generalization:
Property of a model or modeling process, the model applies to data that were not used to build the model
Overfitting
Overfitting
Tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points
Solution? Build a more complex model; they will better capture the real complexities of the application and be more accurate
all data mining procedures overfit data
Overfitting Examined
Fitting Graph: Shows the accuracy of a model as a function of complexity
Examine it?
Holdout Data: Not used to build the model
Generalization performance
Overfitting in Tree Induction
Tree-structures are flexible
Why is Overfitting bad?
A model gets more complex it is allowed to pick up harmful spurious correlations
These spurious correlations produce INCORRECT generalizations in the model, causes performance to decline.
Overfitting in Mathematical Equation
turn the equation more complex by adding more variables
Dataset may end up with large number of attributes; careful manual attribute selection will help
From Holdout Evaluation to Cross-Validation
Cross-validation: more sophisticated holdout training and testing procedure
Not only gives us a simple estimate of the generalization performance, but also some statistics on the estimated performance, such as mean and variance
Variance: critical for assessing confidence in the performance estimate
Better use for a limited dataset
Computes its estimates over all the data by performing multiple splits and systematically swapping samples for testing
Begins by splitting a labeled dataset into k partitions called FOLDS
The Churn Dataset Revisited
Do we trust this number?
Overfitting is a posibility
Next Step: Analyze the average of the folds with classification trees
Then, compare the fold accuracies between logistical regression and classification trees
Learning Curves
A plot of the generalization performance against the amount of training data
Marginal advantage of having more data decreases so the learning curve becomes less steep
Fitting Graph vs Learning Curve
it shows the generalization performance- the performance only on testing data, plotted against the amount of training data used
Fitting graph: shows the generalization performance as well as the performance on the training data, but plotted against model complexity
Overfitting Avoidance and Complexity Control
To avoid overfitting, we control the complexity of the models induced from the data
1st step: Examine complexity control in tree induction
Avoiding Overfiting with Tree Induction
Main Problem: Tree induction will keep growing the tree to fit the training data until it creates pure leaf nodes
Tree Induction Techniques to Avoid Overfitting
(i) To stop growing the tree before it gets too complex
(II) To grow the tree until it is too large , then "prune" it back, reducing its size (and therefore its complexity)
A General Method for Avoiding Overfitting
Nested Holdout Testing
Building models on training subset and pick the best model based on this testing subset. former called the subtraining set and the latter the validation set
Validation Set: separate from the final test set, on which we are never going to make any modeling decisions
Sequential Forward Selection (SFS)