Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch 5: Overfitting and Its Avoidance (Learning Curves (plot of the…
Ch 5: Overfitting and Its Avoidance
Generalization
Table model
Memorizes training data and performs no generalization
100% example
If customer is not part of historical data set, model will fail
Data mining did not create a model that generalized beyond the training data
property of a model that applies to data that was not used to build the model
Overfitting
the tendency of data mining procedures to to tailor models to training data
Holdout Data and Fitting Graphs
Fitting Graph
shows the accuracy of a model as a function of complexity
Need to "hold out" some data that we know the value of the target variable
Wont be used for model
generalization perforamance
compare the predicted values with the hidden true values
Recursive partitioning of the data is done for tree induction
Overfitting in Tree Induction
Overfit if you keep splitting on attributes
tree will give nontrivial classification even for alien instances
Procedure that grows trees until the leaves are pure tends to overfit
Overfitting in mathematical Functions
As you increase the dimensionality, you can perfectly fit larger sets of arbitrary points
Example: Why is overfitting bad?
As a model gets more complex it is allowed to pick up harmful correlations
Harmful when the correlations produce incorrect generalizations
From Holdout Evaluation to Cross-Validation
Holdout set is just a single estimation
Cross Validation: simple estimate of the generalization performance, but with statistics
makes better use of limited data set
computes estimates over all the data
Split data into k-partitions called folds
Churn Data Set Revisited
data set was shuffled and then divided into ten partitions
analyze fold performances
Learning Curves
plot of the generalization performance against the amount of training data
Steep initially as model finds most apparent regularities
Curve then begins to flatten out, think diminishing marginal utility
Different than fitting graph which plots against complexity
For smaller data, tree induction will overfit more
Complexity Control
Tree Induction
problem is that it will keep growing to get pure nodes
creates very complex models
Two techniques to avoid overfitting
stop growing the tree before it becomes too complex
specify minimum number of instances
adapts model based on data distribution
hypothesis test at every leaf
Grow the tree until its too large, then prune back
determine if leaves would decrease accuracy if cut