Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Science for Business By: Fawcett & Provost CH 5: "…

- - - - Fails to generalize
        data beyond what
        was used
      - Fits too perfectly
        to the training data (overfit)
- - - - "Hold out" some data that we
        know the value of the target
        variable, but is not used to
        build the data
        
        In order to examine
        overfitting
      - The bigger the gap between
        holdout and training data, the
        more memorization there is
  - - - From fitting graph, examine
        overfitting with the amount
        of nodes, and then restrict size
  - - - Separating boundary
        doesn't have to be just a line
      - May assign weights to
        different variables
    - - Modelers carfully prune
        the attributes in order
        to avoid overfitting
      - Careful manual attribute
        selection is a wise practice
  - - - The correlations do not
        represent characteristics
        of the population
        
        Ultimately produces incorrect
        generalizations and causes
        performance to decline
- - - - Best Strategy: recognize overfitting
        and manage complexity in a principaled way
- - - - Splits dataset into
        partitions called folds
- - - - Specify minimum number
        of instances
      - What threshold should
        be used?
      - How few instances are
        we willing to tolerate
        at a leaf?
    - - Estimate replacing a set of leaves
        vs. branch with a leaf on accuracy
  - - - Using the data to choose
        the complexity experimentally
      - Sequential forward
        selection (SFS)
        
        Prioritizing features