Please enable JavaScript.

Coggle requires JavaScript to display documents.

Overfitting & Its Avoidance (Overfitting (All data mining procedures…

- - - - *) Fitting Graph: shows the accuracy of a model as a function of complexity. To examine overfitting the concept of holdout data is fundamental.
        
        Problem
        
        Evaluation on training data provides no assessment of how well the model generalizes to unseen cases.
        
        Solution
        
        We need to holdout some data for which we need to know the value of the target variable, but which will not be used to build the model.
      - Holdout data is like creating a "lab test" of generalization performance.
      - The model will predict the values. The estimate the generalization performance by comparing the predicted values with the hidden true values.
- - - - The accuracy of a model it depends on how complex we allow it to be.
- - - - What we have just built is a version of the lookup table, which is an example of overfitting.
      - What will be the accuracy training set?
        
        It will be perfectly accurate, predicting correctly the class for every training instance.
        
        This tree should be slightly better than the lookup table because every previously unseen instance will arrive at some classification, rather than just failing to match; the tree will give a nontrivial classification even for instances it has not seen before.
  - - - In the graph:
        
        Beginning at the left, the tree is very small and has poor performance. As it is allowed more and more nodes it improves rapidly, and both training-set accuracy and holdout-set accuracy improve.
        
        But at some point the tree starts to overfit: it acquires details of the training set that are not characteristic of the population in general, as represented by the holdout set.
        
        The overfitting starts when x = 100 nodes.
        
        The holdout accuracy declines as the tree grows past its “sweet spot”; the data subsets at the leaves get smaller and smaller, and the model generalizes from fewer and fewer data.
      - This example represents the best trade-off between the extremes of (i) not splitting the data at all and simply using the average target value in the entire dataset, and (ii) building a complete tree out until the leaves are pure.
- - - - Shows the generalization performance — the performance only on testing data, plotted against the amount of training data used.
    - - Shows the generalization performance as well as the performance on the training data, but plotted against model complexity.
        
        Classification trees are a more flexible model representation than linear logistic regression. This means two things: for smaller data, tree induction will tend to overfit more.
    - - The learning curve may show that generalization performance has leveled off so investing in more training data is probably not worthwhile; instead, one should accept the current performance or look for another way to improve the model, such as by devising better features.
- - - - 1) Stop growing the tree before it gets too complex
      - 2) Grow the tree until is too large, and then prune it back to reduce the size.
        
        Prune
        
        *) Prune: means to cut off the leafs and branches, then replaced them with leaves.
        
        To estimate whether replacing a set of leaves or a branch with a leaf would reduce accuracy. If not, then go ahead and prune. The process can be iterated on progressive subtrees until any removal or replacement would reduce accuracy.
      - There are various techniques to accomplish both techniques
        
        Simplest
        
        Limit tree size is to specify a minimum number of instances that must be present in a leaf.
        
        Tree induction will automatically grow the tree branches that have a lot of data and cut short branches that have fewer data, thereby automatically adapting the model based on the data distribution.
        
        What is the threshold?
        
        There is no fixed number, although practitioners tend to have their own preferences based on experience.
        
        Another way
        
        1 more item...
- - - - The test data should be strictly independent of model building so that we can get an independent estimate of model accuracy.
        
        Example
        
        We might want to estimate the ultimate business performance or best model from one family against best model from another family.
        
        We perform the Nested Holdout Testing
        
        Take the training set and split it again into a training subset and a testing subset. Then we can build models on this training subset and pick the best model based on this testing subset. Let’s call the former the sub-training set and the latter the validation set for clarity. The validation set is separate from the final test set, on which we are never going to make any modeling decisions.
        
        Nested cross-validation is more complicated, but it works as you might suspect.
        
        This idea of using the data to choose the complexity experimentally, as well as to build the resulting model, applies across different induction algorithms and different sorts of complexity.
        
        Method:
        
        Run with many different features sets, using this sort of nested holdout procedure to pick the best.
        
        Example:
        
        2 more items...
- - - - Finding the set of parameters that maximizes some objective functions indicating how well it fits the data
        
        Complexity control via regularization works by adding to this objective function a penalty for complexity
        
        The λ term is simply a weight that determines how much importance the optimization procedure should place on the penalty, compared to the data fit. At this point, the modeler has to choose λ and the penalty function.
        
        Common Penalties that can be applied
        
        L2-norm
        
        Sum of the squares of the weights
        
        2 more items...