Please enable JavaScript.
Coggle requires JavaScript to display documents.
Understanding the process (curves and speed (learning curves (validation…
Understanding the process
curves and speed
additional data forms - additional features and additional cases - the greater amount of relevant data at the outset of a project, the less likely additional data improves predictability
learning curves
validation scores and % of available data used
lower scores are preferable - logloss is aloss measure whic means mistakes increase the loss scores
examine numbers/models to understand performance methods
three data levels - 16%, 32%, 64%
more data = more accurate
more data = higher cost (will the extra data be worth it in the long run)
use cross validation to evaluate
accuracy tradeoffs - how rapidly the model will evaluate new cases after being put into production
efficient frontier - line between two dots closest to x and y axis when 2 criteria are closely related - line illustrates which models are best
calculate the most accurate model and compare with reality
blueprints - show which model did better than the other modesl
imputes missing values rather than replacing them with an average
use one-hot encoding on categoricals or convert categorical features to ordinal features
imputation - uses the median value to replace missing values
standardization - numeric features are all standardized after imputing missing values
one hot encoding - a new feature created for every category that exists within original feature with true or false values
standard - features are scaled - value is set to 0 and standard deviation is set to unit variance (1)
hyperparameter optimization