Please enable JavaScript.
Coggle requires JavaScript to display documents.
Week 5 Classification - Coggle Diagram
Week 5 Classification
Predictive analytics
Definition
AKA supevised learning
Trying to predict a well-defined outcome using existing data
Must specify one dependent variable and independent variables
Dependent variable: outcome of interest, Y
Independent variables: attributes/features, X
Types
Classification: if outcome to predict is binary or categorical
objective
: identify to which "class" a new observation belongs to, based on data containing observations whose class variable is already known
Steps
Model construction (training): construct a classification model based on training data
1.1 randomly split the data into training data, validation data, test data
1.2 choose classification model and train it on training data
K-NN (nearest neighbor): for a given record, identify "nearest" (most similar) observation
use distance metrics seen in clustering (e.g. Euclidean) to know if an observation is "near" another
k-NN will classify observation according to the predominant class (outcome variable) among the identified K nearest neighbor
must decide
k
, how many neighbors we want to consider
Procedure
pick K, how many neighbors we want to consider
1 more item...
pick a distance measure; determine whether you need to normalize the data
3 more items...
after standardization, compute pairwise Euclidian distance between new observation and existing ones
pick the 1-nearest neighbor, aka the point with the shortest distance
1 more item...
trade-offs
pros: easy to use, does not require statistical assumptions
cons: known as the 'lazy' classifier because there is no real model building, we use labeled data available in our dataset as model; bad for high-dimension datasets; not good for real-time prediction
Decision tree: combines numeric and binary data, starting from a top node and go down to the leaf nodes
terminology
Root node/root: top of the tree
internal/decision nodes/branches: subsequent nodes after root node but not the last nodes
branches: arrows connecting nodes showing direction of tree flow
leaf nodes/leaves: last nodes each contain a class label
purity: measured through the concept of entropy; the lower the entropy, the purer the given data partition, the more the observations in that partition have the same value for the outcome variable
1 more item...
information gain: how to partition data and which should be the root?; information gain = entropy(parent node) — weighted entropy (children node)
stopping criteria: when to stop splitting nodes and create a leaf node
when all data points in a node are from the same class; there are no remaining attributes to further split the data
trade offs
pros: good for variable selection; does not require assumptions about data, robust to outliers
cons: unstable from slight changes in data; it can miss interesting relationships between predictors
Model validation: refine classification model on validation data (sometimes skipped)
Model testing: measure accuracy of final model using test data
3.1 assess "how accurate" the model is by testing on test data
determine unbalanced dataset through
confusion matrix
error rate
: how many did the model get wrong?
accuracy
: how many did the model get right?
recall
: for each actual class, how many were recovered?
sensitivity
specificity
precision
: for predicted class how many did the model get right?
F1-score
: if we care about both recall and precision; harmonic mean of precision and recall; f1 is preferred over accuracy when data is unbalanced
3.2 apply model to classify new data if it is "good enough"
Numerical prediction/regression: if outcome is to predict numerical (continuous
methods
K-NN
Linear regression