Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Science for Business By: Fawcett & Provost CH 5: "…
Data Science for Business
By: Fawcett & Provost
CH 5: "Overfitting and Its Avoidance"
Fundamental Concepts
Fundamental notions of
data science: overfitting and
generalization
Finding patterns
in datasets
Complexity control
Generalization
Finding patterns that generalize (predict)
data that has not been observed yet
Bad Ex)
Table Model
Approach
Fails to generalize
data beyond what
was used
Fits too perfectly
to the training data (overfit)
Overfitting Examined
Holdout Data and
Fitting Graphs
Fitting Graph
: shows the accuracy
of a model as a function of complexity
"Hold out" some data that we
know the value of the target
variable, but is not used to
build the data
In order to examine
overfitting
The bigger the gap between
holdout and training data, the
more memorization there is
Overfitting in
Tree Induction
Slightly better than
lookup table due to
nontrivial classification
From fitting graph, examine
overfitting with the amount
of nodes, and then restrict size
Very flexible in what
they can represent
Complexity of tree lies
in the number of nodes
Overfitting in Mathematical
Functions
Adding more variables can
become more complex
Separating boundary
doesn't have to be just a line
May assign weights to
different variables
Adding new attributes that
are nonlinear versions of
the original ones
Modelers carfully prune
the attributes in order
to avoid overfitting
Careful manual attribute
selection is a wise practice
Why is Overfitting Bad?
As model gets more complex,
it picks up harmful spurious
correlations
The correlations do not
represent characteristics
of the population
Ultimately produces incorrect
generalizations and causes
performance to decline
Overfitting
Definition
: tendency of data
mining procedures to tailor
models to the training data
Problem
: all prodecures tend to
overfit to some extent -- some more
than others
Best Strategy
: recognize overfitting
and manage complexity in a principaled way
From Holdout Evaluation
to Cross-Validation
Cross-Validation
: more sophisticated
holdout training and testing procedure
Maintains benefits of
holdout evaluation, but also
gives statistics on the
estimated performance
Makes better use of
limited datasets
Splits dataset into
partitions called
folds
Learning Curves
All else being equal,
generalization performance
improves as more training data
become available
Learning Curve
: plot of the
generalization performance
against the amount of training
data
Difference w/ fitting graphs
is that generalization performance
is only on testing data BUT plotted
against amount of training data
Overfitting Avoidance
and Complexity Control
Avoid overfitting w/
Tree Induction
Stop growing the tree
before it gets too complex
Specify minimum number
of instances
What threshold should
be used?
How few instances are
we willing to tolerate
at a leaf?
Grow the tree until it is
too large, then "prune" back
Estimate replacing a set of leaves
vs. branch with a leaf on accuracy
A general method for
avoiding overfitting
Nested cross-validation
Using the data to choose
the complexity experimentally
Sequential forward
selection (SFS)
Prioritizing features
Avoiding overfitting for
parameter optimization
Regularization
: optimize some
combination of fit and simplicity