Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch. 15 Build Candidate Models (15.4 Model Selection Process (introduction…
Ch. 15 Build Candidate Models
15.1 Starting the Process
desired target feature can be found in feature list
DataRobot automatically deals with uneven distributions by downsampling (randomly removing cases)
DR offers the option of which metric to optimize the produced models for
LogLodd simply means that rather than evaluating the model directly on whether it assigns cases to the correct label, the model is evaluated based on probabilities generated by the model and their distance from the correct answer
LogLoss would punish the model greatly for being confident and wrong
15.2 Advanced Options
Random and Stratified give identical options, these are based on holdout samples
Best to stick with 20% for holdout %
number of folds can be set manually , for smaller datasets, better to use more folds to leave that max number of cases in the training set
it is a reminder of the distribution between the various samples split into five folds and a holdout
Second, it specifies that the cross validation scores to come from the system are the average of the scores for the number of models used (including the validation model)
Partition Feature is a method for determining exactly which cases are used in different folds
Group approach accomplishes much of the same as with the partition feature but differs in that it allows for specification o f a group membership feature
15.3 Starting the Analytical Process
Quick Run is an abbreviated version of Autopilot that produces almost as good models (on average) by shortcutting the DataRobot best practice machine learning process
Autopilot and Quick are similar, except that for Autopilot DataRobot starts the analysis at 16% of the sample and uses that information to determine which models to run with 32% of the sample
a difference is that in Quick, only four models are automatically cross validated and only one blender algorithm applied
Autopilot will implement the standard DataRobot process and likely lead to the best possible results. This process is explained in greater detail next
the green bar in the Importance column indicates the relative importance of a particular feature when examined against the target independently of all other features
the green bars for this project are not great and suggest that the data for predicting the target is not so suitable
15.4 Model Selection Process
introduction of a sidebar showing several algorithms running on the Amazon Cloud environment
even though each different type of algorithm has very different run-times and processing needs, each is assigned here to a “worker"
if DataRobot gets ahead of your reading, preventing you from exploring beyond what the system is sharing with you, feel free to pause the algorithms by hitting the pause button next to the Worker number
while the algorithms are running on Amazon and completing one by one, they are reported and ranked on the leaderboard
the distinction between algorithms that perform well in the leaderboard and those that do not is important because DataRobot will use the information on which algorithms work well with 16% of the data to make decisions about which algorithms will be selected for further “participation"
this model number will only show up once in the whole leaderboard as a unique identifier for a model
if two models are equally (or close to equally) good, then generally it is good to use the simplest model, a principle known in science as Occam’s Razor
ENET Blenders, one blending the three top models and the other blending the top eight models
the three or eight probabilities are considered features, which are used along with the target value for the whole dataset to run another predictive exercise in case some models are stronger at certain probability levels