Machine Learning
Process & Metric
CRISP-DM
3. Data Preparation
- Select data
- Clean data
- Construct data
- Integrate data
- Format data
Evaluation
- Evaluate results
- Review process
- Determine next steps
2. Data Understanding
- Collect data
- Describe data
- Explore data
- Verify data quality
Modeling
- Select model
- Technique
- Design the test
- Build model
- Assess Mode
1. Business Understanding
- Business objective
- Assess situation
- DM goals
- Project plan
Deployment
- Plan deployment
- Plan monitoring
- Plan maintenance
- Final report
- Review project
Metrics
Training
Data
Accuracy/metric estimates not good indicator
of performance on future data
Measure the degree of overfitting/underfitting
Independent
Test Data
Used when we have plenty data
Natural way of forming training & test data
Hold-Out
Method
Splits the data into training
data & test data
Build a classifier using the train data
and test it using the test data
Stratificaton
Process of dividing members of population
into homogeneous subgroups before sampling
Strata should be mutually exclusive
- Every element in the population must be
assigned to only one stratum
Strata should be collectively exhaustive
- No population element can be excluded
- Simple random sampling/Systematic sampling is applied
k-Fold
Cross Validation
Avoids overlapping test sets data is
equally split into k subsets
Some subset for testing,
remainder for training
Leave-One-Out
Set number of folds to number of training instances
N-1 training instances, 1 test instance
Bootstrap
Uses sampling with replacement to form the training set
Type of resampling
Measurement
Confusion
Matrix
Precision = TP / (TP+FP)
Recall = TP / (TP+TP)
Error = (FP+FN) / (P+N)
Accuracy = (TP+TN) / (P+N)
FP Rate = FP/N
F1-Score
Harmonic mean of precision and recall