Intro To Machine Learning Workflow

Big picture

data

model

problem

setting constraints, goals, and boundaries

dataset knowledge

choosing correct ML model/algo for the problem

Framing the problem

do we need ML

who are the key stakeholders

what data can I use

which metrics are important

whos is going to be impacted

click to edit

should be easy to understand

adapted to the specific problem

business metrics != ml model metrics

confusion matrix

2x2 grid that displays

TP

TN

FP

FN

classification metrics

Precision

TP/(TP + FP)

error rate of algorithm

Recall

TP/(TP + FN)

error rate when the data is the expected value

Accuracy

(TP + TN) / (TP + FN + FP + TN)

how good algorithm is at predicting the correct thing

Object detection

green boxes are ground truths

red dashed boxes are the model's prediction

Of the elements classified as a particular class, how many did we get right?

The number of images classified correctly divided by the total number of images.

The number of correctly classified images over the total number of images.

Only for classification

Uses precision and recall from classification

need to

understand where the data came from

click to edit

Exploratory Data Analysis

looking at data before working with it

object density

environment

weather

light

ML algorithms are sensitive to domain shift

this can happen at different levels

weather/light conditions

sensor

environment

Cross Validation

ensuring an ML algorithm can perform well in any environment

we should evaluate the generalization ability of the ML model

overfitting

bias-variance-tradeoff

cross validation

when the model does no generalize well

why is it hard to create a balance model

technique to evaluate how well a model generalizes

fits data trained exactly

good at predicting and handling data

chosen model is too complex

TestError = Var + Bias + epsilon

error rate of model on test data set

variance - sensitivity to training data

bias - quality of the fit

Test error - error rate on the test data

decreases as model becomes more complex

Increases with complexity

parabola

has an optimal complexity

low variance means the model can adapt to new data easily

Validation Set

include 80-90% of data in the training set

include 10-20% of data in the validation set

used for cross-validation to select the best parameters or compare models

TFRecord

tensorflow record

not human readable

need to use protofile to read fields of TFRecord

tensor flow custom data format

click to edit