Intro To Machine Learning Workflow
Big picture
data
model
problem
setting constraints, goals, and boundaries
dataset knowledge
choosing correct ML model/algo for the problem
Framing the problem
do we need ML
who are the key stakeholders
what data can I use
which metrics are important
whos is going to be impacted
click to edit
should be easy to understand
adapted to the specific problem
business metrics != ml model metrics
confusion matrix
2x2 grid that displays
TP
TN
FP
FN
classification metrics
Precision
TP/(TP + FP)
error rate of algorithm
Recall
TP/(TP + FN)
error rate when the data is the expected value
Accuracy
(TP + TN) / (TP + FN + FP + TN)
how good algorithm is at predicting the correct thing
Object detection
green boxes are ground truths
red dashed boxes are the model's prediction
Of the elements classified as a particular class, how many did we get right?
The number of images classified correctly divided by the total number of images.
The number of correctly classified images over the total number of images.
Only for classification
Uses precision and recall from classification
need to
understand where the data came from
click to edit
Exploratory Data Analysis
looking at data before working with it
object density
environment
weather
light
ML algorithms are sensitive to domain shift
this can happen at different levels
weather/light conditions
sensor
environment
Cross Validation
ensuring an ML algorithm can perform well in any environment
we should evaluate the generalization ability of the ML model
overfitting
bias-variance-tradeoff
cross validation
when the model does no generalize well
why is it hard to create a balance model
technique to evaluate how well a model generalizes
fits data trained exactly
good at predicting and handling data
chosen model is too complex
TestError = Var + Bias + epsilon
error rate of model on test data set
variance - sensitivity to training data
bias - quality of the fit
Test error - error rate on the test data
decreases as model becomes more complex
Increases with complexity
parabola
has an optimal complexity
low variance means the model can adapt to new data easily
Validation Set
include 80-90% of data in the training set
include 10-20% of data in the validation set
used for cross-validation to select the best parameters or compare models
TFRecord
tensorflow record
not human readable
need to use protofile to read fields of TFRecord
tensor flow custom data format
click to edit