Please enable JavaScript.

Coggle requires JavaScript to display documents.

Overfitting and Its Avoidance (Overfitting (Tendency of data mining…

- - - - Most extreme overfitting procedure possible
  - - - Data originally not used to detect/build model
      - "Hold out" data at the beginning
      - Think at it like creating a "lab test"
      - Generalization performance by comparing predicted values & hidden true values
        
        Also known as the "test set"
      - Cross-validation
        
        More sophisticated holdout training & testing procedure
        
        Better use of limited data
        
        Computes estimates over all data
        
        Splits labeled dataset into "k" partitions -folds
    - - Shows accuracy of model as function of complexity
      - Difference between modeling procedure's accuracy on training data and accuracy on holdout data as model complexity changes
  - - - Do it by adding non linear attributes
      - Can also dry attributes to avoid overfitting
        
        Careful manual attribute selection if time friendly
        
        Large data sets may not be feasible
    - - If not fit perfect, fits better and better
- - - - Tree induction will keep growing to fit training data
        
        2 techniques to avoid overfitting
        
        Stop growing tree before gets too complex
        
        Grow tree until it is too large then prune back
    - - Test data strictly independent of model building
        
        So we can get independent estimate of model accuracy
        
        Realize nothing special about first training/test split
    - - Nested-cross validation to chose appropriate weighted importance