When considering the addition of more data, it is important to calculate the cost, if, that is, such data is available. If the data available is data from an earlier date than what is currently owned, it is not clear that the data will lead to improvements, as
data gets “stale” over time. When in the position to consider adding data to a project, instead of using the validation data as in this example, use the cross validation results instead.
16.2 Accuracy Tradeoffs
Select now the Speed vs. Accuracy tab in the sub-menu under Models
( Leaderboard - Learning Curves - Speed vs. Accuracy- Model Comparison ).
16.3 Blueprints
ACCESSING BLUEPRINTS:
To consider the blueprints, start by clicking on the name of the XGBoost model currently ranked as #5 in the leaderboard. This will lead to the screen shown below in Figure 16.4. As with the features’ views, it is only possible to keep one of these model views open at a time. Right under the blue line stating the name of the algorithm used to create this particular model is a sub-menu starting with “Blueprint” ( ).
- 1 more item...
16.3.1 Imputation
- 2 more items...
This screen addresses an important question related to how rapidly the model will evaluate new cases after being put into production.
An “Efficient Frontier” line has been added to illustrate which models are best. The Efficient Frontier line is the line drawn between the dots closest to the X- and Y-axes and usually occurs when two criteria (speed and accuracy in this case) are negatively related to each other (the best solution for one is usually not the best solution for the other). This line won’t show on your screen because it has been manually added for this book to illustrate
which models to pay attention to when learning and when using DataRobot in realworld application. A model must be capable of producing predictions as rapidly as new cases are arriving
- 1 more item...
As the size of the dataset increases, cross validation becomes a better indicator.
EXAMPLE of CROSS VALIDATION: In this case, models
will be evaluated against one-another by dividing all resulting numbers from the cross validation process by five. This is done in order to draw conclusions as if the total dataset were still 1,600 cases (even though the cross validation is working with 8,000 cases). Looking at the cross validation results, the first doubling of data from 16% to 32% also classifies 37 more patients correctly. Note that this improvement is the net gain from avoiding false positives for 51 patients and misclassifying 14 patients who were previously labeled as sick (readmits) but are now labeled as unlikely to be readmitted (false positives). Doubling the data again
(32% to 64%) produces a more worrisome outcome illustrating that the overall classification result is, in fact, worse by 16 patients – 40 additional misclassifications and an improvement in properly detecting sick patients (readmits) of 24.