Please enable JavaScript.
Coggle requires JavaScript to display documents.
Understanding the Process (blueprints (one-hot encoding (min_support…
Understanding the Process
learning curves and speed
would more data improve model predictability?
two types of additional data
additional cases
additional features
rule of diminishing marginal improvement
learning curves screen
validation on y axis, amount of available data used on x
models shown upper left to bottom right, lines are same model with different percentages of data used
models generally perform better with more data, as shown with best performing model for hospital case - from 32% to 64%, however, the percent gain was not as drastic as the earlier rounds
cross validation
as the size of the dataset increases, cross-validation becomes a better indicator than sole validation
accuracy tradeoffs
speed vs accuracy tab
how rapidly will the model evaluate new cases after being put into production?
model must be capable of producing predictions as quickly as new data is coming in
y-axis is optimization model selected and x-axis is speed to complete 2,000 cases
speed and accuracy are negatively correlated - the most accurate models are also often the slowest
only consider speed in model criteria when speed is necessary in situation
always look for the efficient frontier line
blueprints
datarobot applies some proprietary algorithhms
datarobot either one-hot encodes or converts categorical features to ordinal features
shows the backbone of the model blueprint selected
imputation
can find details about how model imputes in missing values imputed box on blueprint
median value of feature is used, other feature created as indicator
true values for indicator are ones that were imputed by datarobot
one-hot encoding
min_support identifies minimum number of rows needed to one-hot encode a categorical feature
card_min and card_max identify the range of unique values used/needed
hyperparameter optimization
where AutoML outperforms even the top data scientists in the world