Please enable JavaScript.
Coggle requires JavaScript to display documents.
CHAPTER 16: UNDERSTANDING THE PRCOESS (LEARNING CURVES & SPEED (3 data…
CHAPTER 16: UNDERSTANDING THE PRCOESS
LEARNING CURVES & SPEED
additional data
features
cases
diminishing marginal improvement
the greater the amount of relevant data at the outset of a project, the less likely additional data will improve predictability
Learning Curves
validation scores on the Y-axis and the percent of the available data used as the X-axis
on the Y-axis lower scores are preferable because LogLoss is a ‘loss’ measure
3 data validation levels
16%
32%
64%
size of dataset increases
cross-validation is better indicator
models evaluated against one another by dividing all resulting numbers from the cross-validation process by five
ACCURACY TRADEOFFS
Efficient Frontier Line
line drawn between dots closest to the X- and Y-axes and usually occurs when two criteria (speed and accuracy in this case) are negatively related to each other
speed vs. accuracy
evaluates models
slowest model is often most accurate
increase prediction speed
add more prediction servers
BLUEPRINTS
"shows the XGBoost model that did better than all other models with the exception of the blender models built on top of this XGBoost model and two to seven other models"
categorical features
one-hot encode
convert to ordinal features
IMPUTATION
numerical features have their missing values imputed
uses median value of the feature
FALSE: row was not imputated
TRUE: given row contains a value that was imputed
STANDARDIZATON
after imputing missing values, the numeric features are all standardized
each feature is "scaled" - mean value is set to zero and standard deviation is set to "unit variance" OR 1
ONE-HOT ENCODING
for any categorical feature that fulfills certain requirements, a new feature is created for every category that exists within the original feature
If feature has "true" and "false" its turned into one new feature with "true/false" values
True - smallest category is removed
HYPERPARAMTER OPTIMIZATION
use of the algorithm itself to create the model