Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 15: Build Candidate Models (15.4 Model Selection Process (15.4.1…

- - - - 5 and 10 are common numbers to use for DataRobot
      - For small datasets, it may make sense to select more folds, because the more folds selected, the more number of cases in the training set. However, the validation sample (the first fold) will be less reliable if it contains a small set of cases.
    - - How to split the data
        
        Date/Time
        
        Deals with a critical evaluation issue: making sure that all validation cases occur in a time period after the time of the cases used to create models
        
        Partition Feature
        
        this is a method for determining exactly which cases are used in different folds. Partition Feature is different from the other approaches in that the user must do their own random or semi-random assignment of cases
        
        Group
        
        Similar to partition feature, although it allows the specification of a group membership feature. Also, DataRobot makes decisions about where a case is to be partitioned but always keeps each group together in only one partition.
        
        Random/Stratified
        
        Essentially same tool, which pulls out the holdout sample for you. You can choose a holdout percentage.
- - - - Step 1 Setting Target Feature
      - Step 2 Creating CV and Holdout Positions
      - Step 3 Characterizing Target Variable
      - Step 4 Loading the Dataset and preparing data
      - Step 5 Saving target and partitioning info
      - Step 6 Analyzing features
      - Step 7 Calculating list of models
    - - The green bar in the Importance column indicates the relative importance of a particular feature when examined against the target independently of all other features.
      - The green bar tops out at a score of .3, so any full green line suggests that the feature predicts 30% or more of the target.
- - - - There are two different ways this is done in DataRobot.
        
        Two average blenders (AVG Blender and Advanced AVG Blender) average the probability score of each model’s prediction for each case.
        
        Another approach employed is ENET Blenders, one blending the three top models and the other blending the top eight models.