Please enable JavaScript.

Coggle requires JavaScript to display documents.

Week 5 Classification - Coggle Diagram

- - - - Dependent variable: outcome of interest, Y
      - Independent variables: attributes/features, X
  - - - objective: identify to which "class" a new observation belongs to, based on data containing observations whose class variable is already known
      - Steps
        
        Model construction (training): construct a classification model based on training data
        
        1.1 randomly split the data into training data, validation data, test data
        
        1.2 choose classification model and train it on training data
        
        K-NN (nearest neighbor): for a given record, identify "nearest" (most similar) observation
        
        use distance metrics seen in clustering (e.g. Euclidean) to know if an observation is "near" another
        
        k-NN will classify observation according to the predominant class (outcome variable) among the identified K nearest neighbor
        
        must decide k, how many neighbors we want to consider
        
        Procedure
        
        pick K, how many neighbors we want to consider
        
        1 more item...
        
        pick a distance measure; determine whether you need to normalize the data
        
        3 more items...
        
        after standardization, compute pairwise Euclidian distance between new observation and existing ones
        
        pick the 1-nearest neighbor, aka the point with the shortest distance
        
        1 more item...
        
        trade-offs
        
        pros: easy to use, does not require statistical assumptions
        
        cons: known as the 'lazy' classifier because there is no real model building, we use labeled data available in our dataset as model; bad for high-dimension datasets; not good for real-time prediction
        
        Decision tree: combines numeric and binary data, starting from a top node and go down to the leaf nodes
        
        terminology
        
        Root node/root: top of the tree
        
        internal/decision nodes/branches: subsequent nodes after root node but not the last nodes
        
        branches: arrows connecting nodes showing direction of tree flow
        
        leaf nodes/leaves: last nodes each contain a class label
        
        purity: measured through the concept of entropy; the lower the entropy, the purer the given data partition, the more the observations in that partition have the same value for the outcome variable
        
        1 more item...
        
        information gain: how to partition data and which should be the root?; information gain = entropy(parent node) — weighted entropy (children node)
        
        stopping criteria: when to stop splitting nodes and create a leaf node
        
        when all data points in a node are from the same class; there are no remaining attributes to further split the data
        
        trade offs
        
        pros: good for variable selection; does not require assumptions about data, robust to outliers
        
        cons: unstable from slight changes in data; it can miss interesting relationships between predictors
        
        Model validation: refine classification model on validation data (sometimes skipped)
        
        Model testing: measure accuracy of final model using test data
        
        3.1 assess "how accurate" the model is by testing on test data
        
        determine unbalanced dataset through confusion matrix
        
        error rate: how many did the model get wrong?
        
        accuracy: how many did the model get right?
        
        recall: for each actual class, how many were recovered?
        
        sensitivity
        
        specificity
        
        precision: for predicted class how many did the model get right?
        
        F1-score: if we care about both recall and precision; harmonic mean of precision and recall; f1 is preferred over accuracy when data is unbalanced
        
        3.2 apply model to classify new data if it is "good enough"
    - - methods
        
        K-NN
        
        Linear regression