Please enable JavaScript.

Coggle requires JavaScript to display documents.

5 Classification: Alternative Techniques (5.1 Rule-Based Classifier (Two…

- - - - default rule
        A default rule with empty precondition is triggered by records that fails all rules to a default class.
    - - Ordered Rules
        
        Rule-Based Ordering Scheme
        
        Class-Based Ordering Scheme
        commonly used.
      - Unordered Rules
  - - - Sequential Covering Algorithm
        
        Rule-Growing Stategy
        general-to-specific
        specific-to-general
        
        Rule Evaluation
        metrics consider both accuracy and coverage;
        Likelihood ratio statistic: prune poor coverage rules;
        Laplace, m-estimate: consider coverage into evaluation;
        FOIL's information gain: consider the support count;
        
        Rule Pruning
        to improve the generalization errors.
        
        Rationale for Sequential Covering
        During the rule growing, the removal of the records of current selected rule may negatively effect the next rule selection.
      - Example: RIPPER Algorithm
        
        is suited for imbalanced class distribution;
        
        works well with noisy data sets;
        
        general-to-specific rule-growing;
        
        use FOIL's information gain;
        
        pruning: (p-n)/(p+n);
        
        stopping condition: MDL and validate set error rate <= 50%;
        
        additional optimization steps to optimize rules ;
    - - C4.5rules Algorithm
        use decision tree model
- - - - Assumption
        attributes are conditionally independent, given the class label y.
        \( P(X|Y = y) = \displaystyle \prod_{i=1}^{d} P(X_i|Y = y) \)
        where \( X = \{ X_1,X_2,...,X_d \} \)
      - \( P(Y|X) = \frac{P(Y) \prod_{i=1}^{d} P(X_i|Y)} {P(X)} \)
        transfer compute posterior probability to class-conditional probability \( P(X_i|Y) \).
        
        for Categorical Attributes
        fraction
        
        for Continuous Attributes
        
        Discretization
        fraction of intervals
        
        Assuming Distribution
        example, Gaussian distribution:
        \( P(X_i = x_i|Y = y_i) = \frac {1} { \sqrt{2 \pi} \sigma_{ij}}exp^- \frac{(x_i- \mu_{ij})^2 } { 2 \sigma_{ij}^2} \)
        where: \( \mu - mean; \sigma^2 - variance \)
        
        M-estimate of Conditional Probability
        
        to avoid the problem \( P(Xi|Y) = 0 \) .
        
        more robust when training examples is small.
        
        \( P(X_i|Y_i) = \frac {n_c + mp} {n+m} \)
        n: total number of instances from class \( y_i \)
        \( n_c \): the number of \( X=x_i \) and \(Y=y_i \)
        m: the equivalent sample size
        p: user-specified parameter;
      - Characteristics
        
        robust to isolated noise points;
        
        can handle missing values by ignoring it;
        
        robust to irrelevant attributes;
        
        Correlated attributes can degrade the performance;
      - 5.3.4 Bayes Error Rate
        example 1 attribute and 2 class labels
        
        \(Error = \int_0^ \hat{x} P(Y_1|X)dX + \int_\hat{x}^\infty P(Y_2|X)dX \)
        
        \( \hat{x} \) (Decision Boundary):
        \( P(X= \hat {x}|Y_1) = P(X= \hat{x}| Y_2) \)
        \( \hat{x} = \frac { \mu_{y1} + \mu_{y2}} {2} \)
    - - Concept
        
        to address the assumption of NBC;
        
        use a direct acyclic graph (dag) associated with a probability table for each node;
        -- if node X has no parents: P(X);
        -- if X has one parent Y: P(X|Y);
        -- if X has more one parents \( \{Y_1, Y_2,...,Y_k\} : P(X|Y_1,Y_2,...,Y_k) \);
      - Property 1 (Conditional Independence)
        A mode in a Bayesian network is conditionally independent of its non-decendants, if its parents are known.
      - Characteristics
        
        prior knowledge, graphical model;
        
        constructing the network is time consuming;
        
        is well suited to incomplete data;
        
        robust to model overfitting;
- - - - Lagrange multiplier method
        \( L_p = \frac{1}{2} {\parallel w \parallel}^2 - \displaystyle \sum_{i=1}^N \lambda_i \Big( y_i(w \cdot x_i + b) -1 \Big) \)
        
        decision boundary
        \( \Big( \displaystyle \sum_{i=1}^N \lambda_iy_ix_i \cdot x \Big) + b =0 \)
  - - - Kernel Trick
        
        to avoid dimensionality problem
- - - - 5.6.4 Bagging
        
        bootstrap aggregating;
        
        uniform probability distribution;
        
        same size of original data;
        
        63% (0.632);
        
        improve generalization error by reducing variance of base classifier;
        
        in low variance and high bias cases, it degrades total performance due to 63%
        
        less susceptible to overfitting because examples have equal sampling chance.
      - 5.6.5 Boosting
        
        based on bagging add weight;
        
        weight can be used:
        -- sampling
        -- learning base classifier;
        
        various boosting differs on
        -- how weights is updated
        -- how combined base classifier's prediction;
        
        AdaBoost
        
        Update weight: the (j+1)th iteration of example i
        \( w_i^{(j+1)} = \frac {w_i^{(j)}} {Z_j} \times
        \begin{cases}
        \exp^{- \alpha_j} & \quad \text{if } C_j(x_i) = y_i \\
        \exp^{\alpha_j} & \quad \text{if } C_j(x_i) \neq y_i \\
        \end{cases}
        \)
        \( \text{where: } Z_j \text{ is nomalization factor to ensure that } \sum_i = w_i^{(j+1)} =1 \)
        
        importance of a classifier \( C_i \)
        \( \alpha_i = \frac{1}{2} \ln \Big( \frac{1- \epsilon_i }{\epsilon_i} \Big) \)
        
        \( \epsilon_i: \text{ error rate of } C_i \)
        \( \epsilon_i = \frac{1}{N} \Big[\displaystyle \sum_{j=1}^N I \Big( C_i(x_j) \neq y_j \Big) \Big] \\ \text{where: } I(p) = 1 \text{ if p is true, 0 otherwise} \)
        \( \text{N: number of training examples} \)
        
        Combine
        \( \text{ the prediction of } C_j \text{ is weighted according to } \alpha_i. \)
        
        traning error
        \( e_{ensemble} \leq \Pi_i \Big[ \sqrt {\epsilon_i(1-\epsilon_i) } \Big] \)
        
        AdaBoost is susceptible to overfitting because it focuses on training examples that wrongly classified.
    - - Random forest
        
        designed for decision tree;
        
        each base decision tree classifier is based on an independent set of random vectors;
        
        vectors are selected from a fixed probability distribution;
        
        \( \text{ Generalization error } \leq \frac{ \bar{\rho} (1-s^2)} {s^2}, \\
        \text{where } \bar{\rho} \text { : average correlation among the trees; } \\
        \text{s: strength of the tree classifiers. }\)
        
        s can be measured in terms of the classifier's margin:
        \( \text{margin, } M(X,Y) = P( \hat{Y_\theta} = Y) - \displaystyle \max_{Z \neq Y} P( \hat{Y_\theta} = Z) \)
        
        Forest-RI: randomly select F input features to split at each node;
        
        Forest-RC: increase the features space by creating linear combinations of the input features;
        
        Step 1: Create random vectors;
        
        Step 2: Use random vector to build nultiple dicision trees;
        
        Step 3: Combine decision trees;
    - - error-correcting output coding