Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 7 (Classifiers: Don't know class so estimate class (Positives…
Chapter 7
Classifiers: Don't know class so estimate class
Positives are bad outcomes
Worthy of attention or alert
Rarer than negatives
Negatives are good outcomes
uninteresting or benign
Accuracy is general measure of classifier performance
Accuracy= (Number of correct decisions made)/ (Total decisions made)
Confusion Matrix
Separates decisions made by classifier
True classes are P and N
Predicted classes are Y and N
Unbalanced Classes
Look for rare positives
Class distribution is skewed
The more skewed means accuracy breaks down
Accuracy is the wrong thing to measure
Problem with accuracy classification
No distinction between false negatives and positives
Assumes errors are equal
Example: False negative: Patient told no cancer but has False positive: told cancer but no
The false negative is more serious
Generalizing beyond Classification
Data scientists estimate the number of stars for an unseen movie
Process uses root mean squared value
But root mean squared value of what?
Is it meaningful
Is there a better metric
Expected Value
Provides framework useful in organizing thinking about data analytic problems
Decomposes Data analytic thinking into:
Structure of problem
elements of analysis that can be extracted from data
elements of analysis need from other sources
Gives a weight to each value based on probability
p(o) is probability
v(o) is value
Expected Value to Frame classifier evaluation
Confusion matrix can give probabilities (p(o))
Take Cell and divide by total
Next you need cost benefit values
Same matrix as confusion except compares cost and benefit
Cost and benefit come from external sources
Then you multiply benefit by probability for expected profit
Class Priors specify likelihood of seeing positive and negative
Be careful about double counting
Counting a benefit and a negative cost for the same thing
It is important to pick a baseline
This lets data scientists compare against something
can use a n alternative model that is simple but not simplistic
One baseline could be a majority classifier
classifier that always chooses the majority in the training data set
Maximizing prediction accuracy is not the goal