Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch 16: Understanding the Process (16.1: Learning curves and speed (2 kinds…
Ch 16: Understanding the Process
16.1: Learning curves and speed
deciding whether more data will improve predictive ability
2 kinds of data:
more features
more cases
working under rules of diminishing marginal improvement
important to consider cost when adding new data
use cross validation results
Learning Curves:
validation scores on the y-axis
percent of available data on x
always look for the efficient frontier line
if time is a factor, you'll need to follow line to left until it is most accurate
speed issues:
to get around, one can simply add more prediction servers
or just need a faster model
Blueprints:
if using DR for analysis, you can prob skip this step
pre processing algorithm for tree algs is applied to data set
assigning rando large number to missing value
may also hot code categorical features
Standardization
some algorithms will struggle with features w different standard deviations
each feature is therefore scaled, meaning face value of feature is set to zero
st. deviation is set to unit variance
One Hot Encoding
yes/no turned into true/false
data robot can then provide insight
ex: if true, drop last level of each feature