Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 15. Build Candidate Models (15.2 Advanced Options (not recommended…
Chapter 15. Build Candidate Models
15.1 Starting the Process
The desired target feature can be found directly in the feature list; hover over it, and then click on the “Use as Target”.
DataRobot automatically deals with it by downsampling (randomly removing cases) the majority class (that which is most common).
LogLoss (Accuracy) simply means that rather than evaluating the model directly on whether it assigns cases (rows) to the correct “label” (False and True), the model is
evaluated instead based on probabilities generated by the model and their distance from the correct answer.
15.2 Advanced Options (not recommended for first time use)
Random is identical in its approach to the approach discussed in Appendix C: 20% of the data is assigned to the holdout sample (the red part of Figure 15.4). This number can be changed to another percentage.
There are occasions where it makes sense to increase the holdout sample to more than 20%, such as when you have a lot of data, and you want to ensure that the holdout evaluation is as accurate as possible.
The only difference between Random and Stratified is that the Stratified option works a bit harder to maintain the same distribution of target values inside the holdout as the other samples
It is assumed that the allocation to train, validation, and holdout samples has been manually specified. To use all three, three
different values in your selected feature are required. If there are more than three unique values exist in the selected feature, only train and validation folds can be selected, and all other values will be assigned to holdout. This option is helpful when analyzing graph (network) data where it was important to make sure that
there was no overlap between friendship networks (groups of people directly connected to each other) in the train and validation sets.
The Date/Time option ensures that models’ training and validation
data are appropriately split by time to avoid this.
Once the lock-box has been filled with holdout cases and appropriately locked, the remaining cases are split into n folds. The (n)umber of folds can also be set manually. For small datasets, it may make sense to select more folds to leave the maximum number of cases in the training set. That said, be aware that selecting more folds comes with drawbacks. The validation sample (the first fold) will be less reliable if it contains a small set of cases. In addition, DataRobot makes very consequential decisions based on the first validation fold. If the intent is to run cross validation on the data, each additional fold leads to one more run through of creating models.
Five and ten are the most commonly chosen numbers of folds, and since the DataRobot folks swear by five, keep the number of folds at 5 for this exercise.
Moving now to the Partition Feature ( ), this is a method for
determining exactly which cases are used in different folds. Partition Feature is different from the other approaches in that the user must do their own random or semi-random assignment of cases.
The Group approach ( ) accomplishes much of the same as with the partition feature, but with some key differences: firstly, it allows for the specification of a group membership feature.
Secondly, DataRobot makes decisions about where a case is to be partitioned but always keeps each group (those with the same value in the selected feature) together in only one partition. Finally, the Date/Time option ( ) deals with a critical evaluation issue: making sure that all validation cases occur in a time period after the time of the cases used to create models.
15.3 Starting the Analytical Process
Autopilot and Quick are similar, except that for Autopilot DataRobot starts the analysis at 16% of the sample and uses that information to determine which models to run with 32% of the sample. Quick starts right at 32% with models that have historically performed well. One other difference is that in Quick, only four models are automatically cross validated and only one blender algorithm applied.
For the dataset we are using, the top models in Autopilot vs. Quick turn out to have near identical performance characteristics on the LogLoss measure.
As a short introduction, just know that Informative Features represents all the data features except for the ones automatically excluded and tagged, for example, as [Duplicate] or [Too Few Values].
Autopilot will implement the standard DataRobot process and likely lead to the best possible results.
Step 1, “Setting target feature,” transfers the user’s target feature decision into the analysis system.
Step 2, “Creating CV and Holdout partitions,” uses the decisions we made in the advanced settings (or the default if we didc nothing) to randomly or semi-randomly assign cases to the holdout and various cross validation folds.
Step 3, “Characterizing target variable,” is where 162 DataRobot will save the distribution of the target to the analysis system for later use in decisions about which models to run.
Step 4, “Loading dataset and preparing data,” is relevant if a) the dataset is large (that is, over 500MB); b) all the initial evaluations before this step will have been conducted with a 500MB sample of the dataset (or all of the dataset are smaller than 500MB); and c) now the rest of the dataset is loaded.
Step 5, “Saving target and partitioning information,” is where the actual partitions are stored in cross validation folds, and holdout sets are stored in a separate file on a disk.
In Step 6, importance scores are calculated (discussed these in the next paragraph). The features have now been sorted by their importance in individually predicting the 163 target.
Step 7, “Calculating list of models,” is where information from steps 3–6 is used to determine which blueprints to run in the autopilot process.
As soon as DataRobot completes these seven steps, a new column will be added to the feature list: Importance ( ), providing first evidence of the predictive value of specific features