Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch16: Understanding the Process (Will more data improve predictive…
Ch16: Understanding the Process
Will more data improve predictive ability?
Additional Data:
Additional Features or Additional Cases
Diminishing marginal improvement: greater the amount of relative data at the beginning, less likely additional data will improve predicability
Learning Curves: validation score on Y axis and percent of available data on X axis
lower scores on Y axis are preferable because LogLoss is a loss measure
calculate the cost of using additional data
when adding data use cross validation results
performance addressed at the 16%, 32%, and 64% levels
Speed vs. Accuracy
how fast will model evaluate new cases
efficient frontier line: when 2 criteria are negatively related
y-axis = logloss score, x-axis = time to score more records
Blueprint pane shows model that did better than the others
convert categorical features to ordinal >= one-hot encoding
Imputation
blueprint shows method and missing values imputed
standardization: set mean to 0 and std dev to 1 so all features have same standard deviation
One-hot encoding: categorical variables that fulfill requirements have new feature created for every category in old feature
Advanced Tuning to see different parameters and change them