Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 16- Understanding the Process (blueprints (one-hot-encode (or…
Chapter 16- Understanding the Process
learning curves and speed
additional data
additional features
additional cases
work under rule of diminishing marginal improvement: greater the amount of relevant data at the outset of a project, the less likely additional data will improve predictability
learning curves
shows the validation scores on yaxis and precent of available data used as xaxis
y-axis = lower scores preferred because LogLoss is a loss measure (every mistake in prediction increases the loss score)
generally models will benefit from more data
when considering addition of data, calculate cost if its available
accuracy and tradeoffs
speed vs accuracy tab under models
addresses important question related to how fast the model will evaluate new cases after being put into production
Efficient frontier line is line drawn b/w the dots closest to x and y axes, usually occurs when two criteria are negatively related to each other (the best solution for one is usually not the best solution for the other)
to evaluate best models, shows the top 10 models by validation score
blueprints
each of the models seen prior employs a diff set or pre-processsing steps unique to that type of model
DataRobot’s proprietary preprocessing algorithm for tree based algorithms (version 20, suggesting that a number of different options exist) is applied to the dataset. This is the only “secret sauce” component of DataRobot in which the inner workings are not available to be examined more closely. It is fair to assume that for missing values, it imputes them either by replacing them with an average for the column, or as the DataRobot documentation suggests, assigning an arbitrarily large negative number to all missing values (for example, -9999) as tree-based algorithms tend to do better with this information
one-hot-encode
or ordinal features
standardiaztion
standardize box after imputation to see that after imputing missing values, the numeric features are all standardized
what's it mean to standaridize a numeric feature?
some algorithms and linear models struggle with features that have diff SD's
scale values
one hot encoding
for any categorical feature that fulfills certain requirements, a new feature is created for every category that exists within the original feature
true false values
advanced tuning
see all diff parameteres avaialbe