Chapter 15: Build Candidate Models (15.4 Model Selection Process (15.4.1…
Chapter 15: Build Candidate Models
15.1 Starting the Process
Select target feature in feature list
Distribution of target feature is displayed
select orange down arrow to chose different metrics (although the program can be trusted to select a good measure)
Keep recommended LogLoss (accuracy) optimization measure.
Logloss (accuracy) means the model is evaluated based on probabilities generated by the model and their distance from the correct answer
15.2 Advanced Options
To see advanced options, click show advanced options.
Number of Cross Validation (CV) Folds
5 and 10 are common numbers to use for DataRobot
For small datasets, it may make sense to select more folds, because the more folds selected, the more number of cases in the training set. However, the validation sample (the first fold) will be less reliable if it contains a small set of cases.
How to split the data
Deals with a critical evaluation issue: making sure that all validation cases occur in a time period after the time of the cases used to create models
this is a method for determining exactly which cases are used in different folds. Partition Feature is different from the other approaches in that the user must do their own random or semi-random assignment of cases
Similar to partition feature, although it allows the specification of a group membership feature. Also, DataRobot makes decisions about where a case is to be partitioned but always keeps each group together in only one partition.
Essentially same tool, which pulls out the holdout sample for you. You can choose a holdout percentage.
15.3 Starting the Analytical Process
Autopilot implements the standard DataRobot process and likely lead to the best possible results.
Step 1 Setting Target Feature
Step 2 Creating CV and Holdout Positions
Step 3 Characterizing Target Variable
Step 4 Loading the Dataset and preparing data
Step 5 Saving target and partitioning info
Step 6 Analyzing features
Step 7 Calculating list of models
As soon as DR implements these seven steps, a new column will be added to the feature list: Importance, providing first evidence of predictive value of specific features
The green bar in the Importance column indicates the relative importance of a particular feature when examined against the target independently of all other features.
The green bar tops out at a score of .3, so any full green line suggests that the feature predicts 30% or more of the target.
15.4 Model Selection Process
15.4.1 Tournament Round 1: 16% Sample
Click on models to see a leaderboard of each algorithm, ranking it by validation score
Each combination of preprocessing steps before application of the algorithm and its parameters is called a “blueprint.”
15.4.3 Tournament Round 3: 64% Sample
15.4.2 Tournament Round 2: 32% Sample
15.4.4 Tournament Round 4: Cross validation
if the validation dataset is small (<= 10,000 cases), run the full
cross validation (CV) process on the eight top models in the leaderboard
15.4.5 Tournament Round 5: Blending
Once the cross validation has finished running, the models are then internally sorted by cross validation score before the best models are blended. To see the order in which the models will be selected, click the Cross Validation column
There are two different ways this is done in DataRobot.
Two average blenders (AVG Blender and Advanced AVG Blender) average the probability score of each model’s prediction for each case.
Another approach employed is ENET Blenders, one blending the three top models and the other blending the top eight models.