Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch. 5 (Overfitting (Avoid overfitting (control complexity of models (tree…
Ch. 5
Overfitting
tailor data to training model
degrades model performance
Recognize overfitting
fitting graph
shows accuracy of model as function of complexity
complexity of model v. error
holdout data
not for building the model
"test set"
tree induction
idea of growing to tree to pure
= overfit because too complex
too many nodes
find sweet spot
where test and training data follow same path
Mathematical functions
idea of increasing dimensions and variables
leads to perfectly fitting models
must careful choose attributes to avoid overfitting
Linear Functions
Support Vector Machine
Avoid overfitting
control complexity of models
tree induction
keeps growing tree to fit model
hence overfitting
Avoid overfitting
stop growing tree
create min. # of instances
grow until too complex then "prune" it
ensure not to reduce accuracy while pruning
General method
nested holdout
using the entire dataset
complexity control
optimize some combo of fit and simplicity
called
regularization
Learning Curves - Analytical tool
performance v.amount training data
shows generalization performance
performance of test data against amount of training data
steep initially
steep while finding best model
more training = less steep curve
Versus Fitting graph
shows generalization against complexity
Accuracy varies based on data
data size
large data size
tree induction more accurate
smaller data size
logistic regression more accurate
(not always)
less flexibility
overfit less
tree induction will overfit more
Holdout Data
Cross-Validation
create generalization of data
better form of holdout
estimates over all the data
iterates through data
"Folds" made from original dataset
(k-1)/k for training
1/k for testing
Each iteration = model
calculate Variance
understand variance across data sets
assess confidence in set
generates multi performance measures
tells average model behavior