Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch. 5 Overfitting (overfitting- tendency of data mining procedures to…
Ch. 5 Overfitting
overfitting- tendency of data mining procedures to tailor models to the training data
fitting graph- shows accuracy of a model as a function on complexity
holdout data, known value of the target variable but will not be used for creating the model
test set, difference between model's general accuracy and accuracy on training set
memorization- most extreme overfitting possible
overfitting tree intro- split the data so subsets eventually become pure
training accuracy > holdout accuracy
overfitting mathematically
parabola
why overfitting is bad= memorizes and is incapable of generalizing. hinders us from improving a model after a certain complexity
as model gets more complex, it's allowed to pick up more harmful spurious correlations
provide incorrect generalizations
cross validation- estimate of generalization performance and stats on estimated performance ie mean & variance
critical in assessing confidence in the performance estimate
makes better use of limited data set/ computes estimates all over data by performing multiple splits and swapping out samples for testing
folds- a split data set, k partitions
k= 5 or 10
compare fold accuracies between logistic regression and classification trees
learning curve- plot of generalization performance against the amount of training data
shape- start of steep, then flatten out
learning curve shows generalization performance/ performance only on testing data plotted against the amount of training data used
fitting graph shows generalization performance as well as the performance on the training data but plotted against model complexity
more flexibility allows more overfitting
avoid overfitting by controlling complexity of the models
2 techniques to avoid overfitting using tree induction
stop growing tree before it gets too complex 2. grow the tree until it is too large, then prune it back reducing size and
simplest method to limit tree size is specify a minimum number of instances that must be present in the leaf
table model-memorizes training data and performs no generalization
generalization- model applies to data where data was not used to build model