Chapter 17. Evaluate Model Performance (17.3 ROC Curve (Next, another key…
Chapter 17. Evaluate Model Performance
One of the AutoML criteria is understanding & learning, meaning that the AutoML should improve a user’s understanding of the problem by providing
visualization of the interactions between the features and the target.
FVE Binomial provides a sense of how much of the variance in the dataset has been explained and is equivalent to an R2
-value. In simpler terms, this metric states how far off, percent-wise, the model is from fully explaining who will be readmitted (to turn an R2-value into a percent, multiply it by 100).
17.2 A Sample Algorithm and Model
Many tree-based algorithms build on the logic of the decision-tree classifier, which is to repeatedly find the most predictive feature at that instance and split it into two groups that are as internally homogeneous as possible.
The many types of tree-based algorithms are often combinations of hundreds or thousands of decision trees.
DataRobot allows for the examination of models but
does not allow for the examination of the algorithms that created them.
However, DataRobot does share the origin of algorithms and the parameters used therein.
To demonstrate how a decision tree classifier works with a minimum of complexity, a reduced version of the diabetes dataset has been extracted containing only three of the most important features in most of our other models: discharge_disposition_id, number_diagnoses, and number_inpatient.
The decision tree classifier works through the following steps to create this tree:
Find the most predictive feature (the one that best explains the
target) and place it at the root of the tree.
Split the feature into two groups at the point of the feature where the two groups are as homogenous as possible
Repeat step 2 for each new branch (box).
17.3 ROC Curve
The Receiver Operating Characteristics (ROC) Curve screen, so named because of the ROC curve in the bottom left corner, is where several central measures of model success exist beyond the original optimization metric, LogLoss.
Generally, validation and cross validation scores should not differ wildly,given that DataRobot makes important decisions based on validation early in the AutoML process.
This is the probability value at which DataRobot changes a prediction from negative (no readmit) to positive (readmit). The threshold is shown with a score and a vertical line identifying the best cutoff to separate the two mountains.
DataRobot has determined a threshold that maximizes a common measure, the F1- score, a measure that denotes model success in predicting the positive values (readmission). The F1-score will be returned to shortly after clarifying a few further points of knowledge.
With the threshold at its default position, cases with probabilities at or above .3048 will be classified as readmits, and those below .3048 will be classified as non-readmits.
Next, another key measure will be covered, positive predictive value (PPV). This measure is derived from the two rightmost quadrants and is more often called precision.
It is calculated as TP/TP + FP or the number of cases in the bottom right divided by the number of cases in the right two quadrants.
The second most important measure is called True Positive Rate (TPR), or more commonly in Medicine, Sensitivity.
it is possible to calculate
the F1-score by taking the harmonic mean of Positive Predictive Value and True Positive Rate. The harmonic mean is calculated through 2TP/2PT + FP +FN but it can be hard to conceptualize.
Negative Predictive Value (NPV) is the mirror image
of Positive Predictive Value.
All these measures are calculated based on the confusion matrix numbers as they exist at a single threshold, and as such must be considered static, or at least incomplete due to the fact that they do not provide any information about the model performance at multiple thresholds.
ROC curves are considered dynamic because they evaluate the model performance at several prediction distribution thresholds.
A good model will tend to curve up toward the upper-left corner
if a model goes straight up to the top left corner and then to the right in a right angle, it is often a sign of target leakage or that the problem being addressed is simply not very sophisticated.
A simple way to think about a good AUC score is having a low FPR and a high TPR at any given threshold.
The TPR is the proportion of true positives found, specifically, how many patients are going to actually show up at the hospital again within a month relative to how many patients the model predicted.
The False Positive Rate is how many of the negative cases are classified as positives, that is, the proportion of patients who are not going to return within a month but that we think will return.
The measure used to communicate a single number that represents model quality according to the ROC curve is the Area Under the Curve (AUC) value.
17.4 Using the Lift Chart for Business Decisions
The lift chart is constructed by sorting all validation cases by their probability of readmission (or in this case, cross validation, which means every case that is not part of the holdout sample).
. After being sorted, the cases are split into 10% bins, that is, bins of 800 cases.
The ideal scenario is having blue and orange lines that are fully overlapping, indicating a strong model.
This chart also shows which parts of the model are struggling the most.
Start by sorting by partition and removing any rows that indicate that they are in the holdout sample
Then sort the remaining rows by the Cross Validation Prediction column with the highest scores on top
To better understand the results in the lift chart, apply the Average formula on the first 800 cases in the Cross Validation Prediction column to find that the average is 0.67.