Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 5: Overfitting and Its Avoidance (Overfitting (tendency of (data…
Chapter 5: Overfitting and Its Avoidance
Overfitting
tendency of
data mining procedures
to tailor models
to training data
all models do this
to some extent
memorization
most extreme overfitting
trade-off
between
model complexity
possible overfitting
no procedure
can completely eliminate
overfitting
must understand
how to recognize
overfitting
before trying to
eliminate it
Holdouts and Fitting
Fitting Graph
shows
model accuracy
as a function
of complexity
Training Data
cannot
assess model accuracy
for unseen cases
Generalization Performance
compares
predicted values
hidden true values
Holdout Data
data that is
hidden from model
to test model accuracy
often called
test set
Models
Tree Induction
can be overfit
by splitting data
too many times
leaving leaves with
pure definitions
still may generalize
because model will
arrive at some classification
complexity lies
in number of nodes
"sweet spot"
point in tree model where
overfitting begins
Mathematical Functions
complexity can be controlled
by
adding more variables
More complex
adding more attributes
More complex
often have to prune attributes
to reduce overfitting
Increasing dimensionality
More complex
Avoiding Overfitting
Tree Induction
problem is
model keeps growing
until nodes
contain pure definitions
two strategies
stop growing tree
before it's too complex
grow tree to too large
trim back unnecessary attributes
Generally
Nested Training Set
can estimate
generalization performance
by splitting training set
validation set
for final testing
sub-training set
to compare model to
Cross Validation
analyzes
performance among
various datasets
identifies
mean
variance
critical for assessing
confidence in performance