Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 5: Overfitting and its Avoidance (Learning curves (for smaller…
Chapter 5: Overfitting and its Avoidance
table model
memorizes the training data performs no generalization
remember, every dataset is a finite sample of a population
overfitting : is the tendency of data mining procedures to tailor models to the training data
at the expense of generalization to previously unseen data points
holdout data
lab test of generalization performance
usually a difference between a models accuracy on the training set and on the generalization accuracy
often referred to as the test set
overfitting in tree induction
restrict treee size to 100 nodes
why is overfitting bad?
as the model gets more complex it is allowed to pick up harmful spurious correlations
these do not represent characteristics of the population in general
cross validation is a more sophisticated holdout trainmen and testing procedure
makes better use of a limited data set
splitting labeled datasets into partitions called FOLDS
usually 5-10 folds
compute average and standard deviation
logistic regression vs classification trees
sourious: not being what it purports to be; false or fake.
Learning curves
if the training set size changes , you may also expect different generalization performance
a plot of the generalization performance against the amount of training data is called the learning curve
usually have a characteristic shape
shows generalization performance - performance only on testing data
A fitting graph measures generalization but is plotted against comp[lexity
for smaller data, tree induction will tend to overfit more
to avoid overfitting we attempt to reduce the complexity of the model
if we have a group of models we want to rank by generalization performance
nested crossed validation
regularization
tradeoffs between complexity and overfitting
beat way to test overfitting id with hold out data
a fitting graph
has 2 curves
one for model performance
one as a baseline
reining in model complexity to avoid overfitting