Provost Ch. 7 - What is a Good Model?

Confusion Matrix

False Positive = negative instances classified as positive

False Negative = positives classified as negative

Unbalanced Classes

Because the unusual class is rare among the general population, the class distribution is unbalanced or skewed.

More skewed causes evaluation based on accuracy to break down.

Accuracy is the wrong thing to measure

Makes no distinction between false negative and false positive errors

Expected Value

Decomposes data analytic thinking into:

  1. the structure of the problem
  1. the elements of the analysis that can be extracted from the data
  1. the elements of the analysis that need to be acquired from other sources

is the weighted average of the values of the different possible outcomes where the weight given to each value is its probability of occurence

Can use this to determine which model will work best

Cost and Benefit Matrix

specifies the cost or benefit of making a decision for each pair

Costs and benefits cannot be estimated from the data - depend on external information

Measure expected values instead

It is important to consider carefully what would be a reasonable baseline against which to compare model performance

classification tasks: good baseline = majority classifier (chooses the majority class of the training dataset)

Maximizing prediction accuracy is not always appropriate

Predict average value over population

Mutiple simple averages that one might want to combine