Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch. 16 Understanding the Process (16.3 Blueprints (categorical features…
Ch. 16 Understanding the Process
16.1 Learning Curves and Speed
it is important to understand whether more data would improve the models’ predictive ability
There are two forms of additional data: additional features and additional cases
When considering the addition of more data, it is important to calculate the cost, if, that is, such data is available
If the data available is data from an earlier date than what is currently owned, it is not clear that the data will lead to improvements, as data gets “stale” over time
Looking at the cross validation results, the first doubling of data from 16% to 32% also classifies 37 more patients correctly
It is worth noting that classification of human behavior as it relates to a disease as complex as diabetes is very difficult
Only by further immersion in studying success measures and assigning a dollar value to each type of success, will it become clear whether gathering more data is worthwhile
16.2 Accuracy Tradeoffs
The Efficient Frontier line is the line drawn between the dots closest to the X- and Y-axes and usually occurs when two criteria (speed and accuracy in this case) are negatively related to each other
A model must be capable of producing predictions as rapidly as new cases are arriving
An “Efficient Frontier” line has been added to illustrate which models are best
If fast responses are not needed, it may be ok if the model is not able to keep up with peaks in prediction need
If time is a factor, the efficient frontier will need to be followed to the left until the most efficient model is found that is still reasonably accurate
if the website receives 230 visitors per second, two prediction servers will be needed to keep up, or a faster model is needed
16.3 Blueprints
categorical features are one-hot encoded, and numerical features have their missing values imputed
After seeing the model creation and scoring process, the model blueprints addressed at the start of this chapter can now be more easily understood
This is the only “secret sauce” component of DataRobot in which the inner workings are not available to be examined more closely
The imputation in this model uses the median value for the feature. To see more details, click on the “Missing Values Imputed” box in the blueprint
Each of the models seen prior employs a different set of pre-processing steps unique to that type of model
Each feature is therefore “scaled,” which means that the mean value of the feature is set to zero and the standard deviation is set to “unit variance,” which is a fancy way to say 1
By reading the first paragraph starting with “Impute missing values...” we learn that the median value of the feature is used and that, for each feature imputed, another feature is created called an “indicator.”
After imputing age and standardization of the feature, only the imputed and standardized feature is used by the Regularized Logistic Regression algorithm
for any categorical feature that fulfills certain requirements, a new feature is created for every category that exists within the original feature
If other algorithms cannot outperform this relatively old and simple approach, this may be indicative of problems with either the data or the advanced models.
16.4 Hyperparameter Optimization
his is the step where AutoML provides one of its single most important contributions to machine learning; hyperparameter optimization
iven that such parameter tuning can mean the difference between a successful project and a mediocre project, this provides a sense of why companies like Google place such a premium on hiring the best-of-the-best data scientists
All this means is that the algorithm generates 2,500 trees, but in the end, it makes predictions with only 80 of those trees