Chapter 16. Understanding the Process

16.1 Learning Curves and Speed

Chapter Focus: Whether or not more data cases are needed to increase the accuracy of models produced. As well the pre-processing that goes into each model in order to learn more about machine learning.

click to edit

There are two forms of additional data:

Additional Features

Additional Cases

The rule of diminishing marginal improvement: the greater the amount of relevant data at the outset of a project, the less likely additional data will improve predictability.

Learning Curves under leaderbard shows the validation scores on the Y-axis and the percent of the available data used as the X-axis. Remember that, in this case, on the Y-axis, lower scores are preferable because LogLoss is a ‘loss’ measure (every mistake in prediction increases the ‘loss score’).

When considering the addition of more data, it is important to calculate the cost, if, that is, such data is available. If the data available is data from an earlier date than what is currently owned, it is not clear that the data will lead to improvements, as
data gets “stale” over time. When in the position to consider adding data to a project, instead of using the validation data as in this example, use the cross validation results instead.

With loss measures DataRobot will sort the best models in the bottom, so that the curves stretch from upper left to the bottom right generally with an "elbow" towards the bottom left.

Generally, it is safe to state that models
will benefit from more data. While doubling the dataset from 3,200 to 6,400 cases did not lead to a linear increase in LogLoss over the previous doubling, it nearly did so in the case of the Elastic-Net mode

16.2 Accuracy Tradeoffs

As the size of the dataset increases, cross validation becomes a better indicator.

EXAMPLE of CROSS VALIDATION: In this case, models
will be evaluated against one-another by dividing all resulting numbers from the cross validation process by five. This is done in order to draw conclusions as if the total dataset were still 1,600 cases (even though the cross validation is working with 8,000 cases). Looking at the cross validation results, the first doubling of data from 16% to 32% also classifies 37 more patients correctly. Note that this improvement is the net gain from avoiding false positives for 51 patients and misclassifying 14 patients who were previously labeled as sick (readmits) but are now labeled as unlikely to be readmitted (false positives). Doubling the data again
(32% to 64%) produces a more worrisome outcome illustrating that the overall classification result is, in fact, worse by 16 patients – 40 additional misclassifications and an improvement in properly detecting sick patients (readmits) of 24.

Select now the Speed vs. Accuracy tab in the sub-menu under Models
( Leaderboard - Learning Curves - Speed vs. Accuracy- Model Comparison ).

16.3 Blueprints

This screen addresses an important question related to how rapidly the model will evaluate new cases after being put into production.

An “Efficient Frontier” line has been added to illustrate which models are best. The Efficient Frontier line is the line drawn between the dots closest to the X- and Y-axes and usually occurs when two criteria (speed and accuracy in this case) are negatively related to each other (the best solution for one is usually not the best solution for the other). This line won’t show on your screen because it has been manually added for this book to illustrate
which models to pay attention to when learning and when using DataRobot in realworld application. A model must be capable of producing predictions as rapidly as new cases are arriving

Start by calculating the speed of the slowest model (assuming, as is often the case, that it is also the most accurate model) by converting the numbers into cases per second.
Compare this result then with the reality of the predictive
needs required by the given project at hand. If the slowest model produces results more rapidly than needed, ignore speed as a criterion in model creation. Speed must be evaluated against peak-time prediction needs.

Always look for the efficient frontier line. If time is a factor, the efficient frontier will need to be followed to the left until the most efficient model is found that is still reasonably accurate. Do note, however, that not all speed issues are unchangeable.
One can simply add more prediction servers many times as a method to increase prediction speed. Take as an example the prediction of whether a visitor to a website is one who might react positively to an ad. If this can be done in 0.0087 seconds (1 second divided by 115 predictions per second) and still leave time to bid for the right to place that ad in an online auction, then the model can be used. However, if the website receives 230 visitors per second, two prediction servers will be needed to keep up, or a faster model is needed.

ACCESSING BLUEPRINTS:
To consider the blueprints, start by clicking on the name of the XGBoost model currently ranked as #5 in the leaderboard. This will lead to the screen shown below in Figure 16.4. As with the features’ views, it is only possible to keep one of these model views open at a time. Right under the blue line stating the name of the algorithm used to create this particular model is a sub-menu starting with “Blueprint” ( ).

The Blueprint pane shows the XGBoost model that did better
than all other models with the exception of the blender models built on top of this XGBoost model and two to seven other models.

16.3.1 Imputation

Blue Prints : is the only “secret sauce” component of DataRobot in which the inner workings are not available to be examined more closely.

DataRobot may also either “one-hot-encode” categorical
features (as discussed in Chapter 10.1) or convert categorical features to ordinal features (ordered categoricals). Converting categorical features takes less processing than one-hot encoding and tends to perform as well or better. The process also most likely uses the most important information from the three text
features (diag_1_desc, diag_2_desc, and diag_3_desc) after the texts have been processed by the Auto-Tuned Word N-Gram Text Modeler.

Input missing values with median.

With “Impute missing values…” it is states that the median value
of the feature is used and that, for each feature imputed, another feature is created called an “indicator.”

the indicator feature will simply contain a False if that row was
not imputed and a True if a given row contains a value that was imputed. This indicator variable allows the algorithm to look for predictive value about which patients are missing information for a given feature. For example, it is possible that patients for which age was not entered due to their treatment regime or how they
arrived at the hospital provide unique information from this lack of data

16.3.2 Standardization

Now, click on the Standardize box to see that after imputing missing values, the numeric features are all standardized Link Title

16.3.3 One-Hot Encoding

Standardize a numeric feature: it becomes clear that some algorithms, such as Support Vector Machines and some linear models (including the one used in this blueprint) will struggle with features that have different standard deviations. Each feature is
therefore “scaled,” which means that the mean value of the feature is set to zero and the standard deviation is set to “unit variance,” which is a fancy way to say 1.

click to edit

One-hot encoding for any categorical feature that fulfills certain requirements, a new feature is created for every category that exists within the original feature. Figure 16.13 shows that if the original feature has only two values (for example, “yes” and “no”), it is turned into one new feature with True and False values. This is necessary for many algorithms (termed “estimators”) to function as designed