Machine Learning
Process & Metric

CRISP-DM

3. Data Preparation

  • Select data
  • Clean data
  • Construct data
  • Integrate data
  • Format data

Evaluation

  • Evaluate results
  • Review process
  • Determine next steps

2. Data Understanding

  • Collect data
  • Describe data
  • Explore data
  • Verify data quality

Modeling

  • Select model
  • Technique
  • Design the test
  • Build model
  • Assess Mode

1. Business Understanding

  • Business objective
  • Assess situation
  • DM goals
  • Project plan

Deployment

  • Plan deployment
  • Plan monitoring
  • Plan maintenance
  • Final report
  • Review project

Metrics

Training
Data

Accuracy/metric estimates not good indicator
of performance on future data

Measure the degree of overfitting/underfitting

Independent
Test Data

Used when we have plenty data

Natural way of forming training & test data

Hold-Out
Method

Splits the data into training
data & test data

Build a classifier using the train data
and test it using the test data

Stratificaton

Process of dividing members of population
into homogeneous subgroups before sampling

Strata should be mutually exclusive

  • Every element in the population must be
    assigned to only one stratum

Strata should be collectively exhaustive

  • No population element can be excluded
  • Simple random sampling/Systematic sampling is applied

k-Fold
Cross Validation

Avoids overlapping test sets data is
equally split into k subsets

Some subset for testing,
remainder for training

Leave-One-Out

Set number of folds to number of training instances

N-1 training instances, 1 test instance

Bootstrap

Uses sampling with replacement to form the training set

Type of resampling

Measurement

Confusion
Matrix

Precision = TP / (TP+FP)

Recall = TP / (TP+TP)

Error = (FP+FN) / (P+N)

Accuracy = (TP+TN) / (P+N)

FP Rate = FP/N

F1-Score

image

Harmonic mean of precision and recall