Chapter 15: Build Candidate Model (Starting the Process (Select target…
Chapter 15: Build Candidate Model
Starting the Process
Select target feature. The desired target
feature can be found directly in the feature list; hover over it, and then click on the
“Use as Target” text
Alternatively, the name of the feature can be typed into the target area. This field will auto-populate possible features based on text input.
Type until the intended target shows up and then select it. Once this selection is made, the top of the window changes
Note that DataRobot offers the option of which metric to optimize the
produced models for.
In general, DataRobot can be trusted to select a good
measure, but sometimes a customer or client has a preference for another measure
on which they will evaluate the performance of the resulting models, in which case,
consider using their measure.
LogLoss (Accuracy) simply means that rather than evaluating the model directly on
whether it assigns cases (rows) to the correct “label” (False and True), the model is
evaluated instead based on probabilities generated by the model and their distance
from the correct answer.
Click "Advanced Options". There will be a Partitioning Options window with methods, run models using (cross-validation or train-validation-holdout), number of cross validation folds, and holdout percentage
For small datasets, it may make sense to select more folds to leave the
maximum number of cases in the training set.
That said, be aware that selecting
more folds comes with drawbacks. The validation sample (the first fold) will be less
reliable if it contains a small set of cases.
DataRobot makes very
consequential decisions based on the first validation fold. If the intent is to run
cross validation on the data, each additional fold leads to one more run-through of
Starting the Analytical Process
It is now time to start the analytical process, which will prepare the data through
the prescribed options: Autopilot, Quick, and Manual
so, make sure to reset the options to where they were originally before this little
exploration: Stratified partitioning method with 5-fold cross validation and 20%
This is a method for determining exactly which cases are used in different folds
Partition Feature is different from the other approaches in that the user must do their own random or semi-random assignment of cases. It is assumed that the allocation to train, validation, and holdout samples has been manually specified. To use all three, three different values in your selected feature are required
Accomplishes much of the same as with the partition feature, but with some key differences
First, it allows for the specification of a group membership feature. Second, DataRobot makes decision about where a case is to be partitioned but always keeps each group together in only one partition. Finally, the Date/Time option deals with a critical evaluation issue: making sure that all validation cases occur in a time period after the time of the cases used to create models.
Think of it this way: a data scientist claims to have created a model that almost perfectly predicts the weather according to his cross validation score