Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch. 16 Understanding the Process (Learning Curves and Speed (two forms of…
Ch. 16 Understanding the Process
Learning Curves and Speed
two forms of additional data
features
cases
could improve model's predictive ability
rule of diminishing marginal improvement
the greater the amount of relevant data at the beginning of project, the less likely additional data will improve predictability
Learning Curves tab on DataRobot
shows validation scores on y-axis
lower score is preferable
LogLoss is a "loss" measure
each mistake increases "loss" measure
shows % of available data used on x-axis
models measured at three levels (% of data)
16%
32%
64%
consider cost when considering adding additional data
instead use cross validation
cross validation becomes better indicator when size of data set increases
models evaluated against one another by dividing resulting numbers from cross validation process by five
Accuracy Tradeoffs
speed vs. accuracy tab under "models"
Efficient Frontier line helps illustrate which models are best
drawn between dots closest to x and y axes
occur when speed and accuracy are negatively related
if speed isn't necessary, go for accuracy
ex. of a model that needs to be fast and accurate: self-driving cars
Blueprints
shows how one model performed compared to others
shows information about what the model did
imputation
missing values replaced with mean or median of feature
standardization
numeric features are "scaled"
mean value is set to 0 and std deviation is set to 1
one-hot encoding
dummy variable transformation of categorical features
ex: "yes" and "no" transformed to "true" and "false"
multiple categories with <10 rows are put into "All Other" category