Please enable JavaScript.
Coggle requires JavaScript to display documents.
5 Classification: Alternative Techniques (5.1 Rule-Based Classifier (Two…
5 Classification: Alternative Techniques
5.1 Rule-Based Classifier
Concepts
rule set:
R = (r1 v r2 v ... rk)
rule antecedent
or
precondition
->
rule consequent:
ri: (Condition i) -> yi
*Condition i = (A1 op v1) ^ (A2 op v2) ^... (Ak op vk)
op is a logical operator (=, !=, <, >, <=, =>)
conjuct
: (Ai op vi)
Coverage(r) = |A|/|D|
Accuracy(r) = |A and y|/|A|
|A| is the number of records triggered by r;
|D| is the total number of records;
|A and y| is the number of records triggered by r and its consequent is y;
Two Key Properties
Exhaustive Rules
All records should trigger rules.
default rule
A default rule with empty precondition is triggered by records that fails all rules to a default class.
Mutually Exclusive Rules
Each records can only trigger one rule.
Ordered Rules
Rule-Based Ordering Scheme
Class-Based Ordering Scheme
commonly used.
Unordered Rules
5.1.3 How to Build a Rule-Based Classifier
Direct Methods for Rule Extraction
from data
Sequential Covering Algorithm
Rule-Growing Stategy
general-to-specific
specific-to-general
Rule Evaluation
metrics consider both accuracy and coverage;
Likelihood ratio statistic: prune poor coverage rules;
Laplace, m-estimate: consider coverage into evaluation;
FOIL's information gain
: consider the support count;
Rule Pruning
to improve the generalization errors.
Rationale for Sequential Covering
During the rule growing, the removal of the records of current selected rule may negatively effect the next rule selection.
Example:
RIPPER Algorithm
is suited for
imbalanced class distribution
;
works well with
noisy data sets
;
general-to-specific rule-growing;
use FOIL's information gain;
pruning:
(p-n)/(p+n)
;
stopping condition: MDL and validate set error rate <= 50%;
additional optimization steps to optimize rules ;
Indirect Methods for Rule Extraction
from other models
C4.5rules Algorithm
use decision tree model
5.1.6 Characteristics of Rule-Based Classifiers
Its
expressiveness
is almost equivalent to decision tree classifier;
It generally to be used for
descriptive model
with the almost performance with decision tree classifier;
class-based ordering approach
is well suited for
imbalance class distribution
because of class-based ordering approach;
5.2 Nearest-Neighbor classifiers
Eager learners
: decision tree, rule-based
Lazy Learners
: Rote classifier, nearest-Neighbor
Characteristics of Nearest-Neighbor Classifiers
is a type of
instance-based learning
, based on the distance or similarity of training data;
expensive
on computing similarity, eager learners are expensive on model building;
susceptible to noise
because its prediction is based on local information;
produce
arbitrarily shaped
decision boundary;
appropriate
proximity measure
and data preprocessing required.
5.3 Bayesian Classifiers
Bayes theorem
\( P( Y|X) = \frac {P(X|Y)*P(Y)} {P(X)} \)
P(Y|X): posterior probability;
P(X|Y): class-conditional probability;
P(Y): prior probability;
P(X): evidence
:star2:
Naive Bayes Classifier (NBC)
:
Assumption
attributes are
conditionally independent
, given the class label
y
.
\( P(X|Y = y) = \displaystyle \prod_{i=1}^{d} P(X_i|Y = y) \)
where \( X = \{ X_1,X_2,...,X_d \} \)
\( P(Y|X) = \frac{P(Y) \prod_{i=1}^{d} P(X_i|Y)} {P(X)} \)
transfer compute posterior probability to class-conditional probability \( P(X_i|Y) \).
for
Categorical Attributes
fraction
for
Continuous Attributes
Discretization
fraction of intervals
Assuming Distribution
example
, Gaussian distribution:
\( P(X_i = x_i|Y = y_i) = \frac {1} { \sqrt{2 \pi} \sigma_{ij}}exp^- \frac{(x_i- \mu_{ij})^2 } { 2 \sigma_{ij}^2} \)
where: \( \mu - mean; \sigma^2 - variance \)
M-estimate of Conditional Probability
to avoid the problem \( P(Xi|Y) = 0 \) .
more robust when training examples is small.
\( P(X_i|Y_i) = \frac {n_c + mp} {n+m} \)
n: total number of instances from class \( y_i \)
\( n_c \): the number of \( X=x_i \) and \(Y=y_i \)
m: the equivalent sample size
p: user-specified parameter;
Characteristics
robust to isolated noise points;
can handle missing values by ignoring it;
robust to irrelevant attributes;
Correlated attributes can degrade the performance;
5.3.4 Bayes Error Rate
example 1 attribute and 2 class labels
\(Error = \int_0^ \hat{x} P(Y_1|X)dX + \int_\hat{x}^\infty P(Y_2|X)dX \)
\( \hat{x} \) (Decision Boundary):
\( P(X= \hat {x}|Y_1) = P(X= \hat{x}| Y_2) \)
\( \hat{x} = \frac { \mu_{y1} + \mu_{y2}} {2} \)
5.3.5 Bayesian Belief Networks (BBN)
Concept
to address the assumption of NBC;
use a direct acyclic graph (dag) associated with a probability table for each node;
-- if node
X
has no parents:
P(X)
;
-- if
X
has one parent
Y
:
P(X|Y)
;
-- if
X
has more one parents \( \{Y_1, Y_2,...,Y_k\} : P(X|Y_1,Y_2,...,Y_k) \);
Property 1 (Conditional Independence)
A mode in a Bayesian network is conditionally independent of its non-decendants, if its parents are known.
Characteristics
prior knowledge, graphical model;
constructing the network is time consuming;
is well suited to incomplete data;
robust to model overfitting;
Joint Probability
\( P( X,Y) = P(Y|X)*P(X) = P(X|Y)*P(Y)\)
5.4 Artificial Neural Network (ANN)
Concept
neuron(axon->synapse) -> neuron(dendrite)
5.4.1 Perceptron
\( \hat{y} = sign[w_dx_d+...+w_1x_1+w_0x_0] = sign(w \cdot x) \)
where \( w_0 = -t, x_0 = 1 \)
Learning Perceptron Model
key is wight update formula:
\(w_j^{(k+1)} = w_j^{(k)} + \lambda(y_i - \hat{y}_i^{(k)}) x_{ij} \)
where \( \lambda \) is
learning rate
5.4.2 Multilayer Artificial Neural Network
hidden layers & hidden nodes;
feed-forward neural network;
activation functions: Linear, Sigmoid, Tanh, Sign;
Learning the ANN Model
to determine wights
w
to minimize:
\( E(w) = \displaystyle \frac{1} {2} \sum_{i=1}^N(y_i-\hat{y_i})^2 \)
weight update formula
\( w_j \leftarrow w_j - \lambda \frac{\partial E(w)}{\partial w_j} \)
5.4.3 Characteristics of ANN
Multilayer neural networks with at least one hidden layer are
universal approximators
;
ANN can handle redundant features.
Sensitive to noise in training set (validate set, decrease weight at each iteration);
The gradient descent method often converges to local minimum (add momentum term);
Training ANN is time consuming, but classification is rapid.
5.5 Support Vector Machine(SVM)
5.5.1 Maximum Margin Hyperplanes
5.5.2 Linear SVM: Separable Case
(maximal margin classifier)
Linear Decision Boundary
\( y =
\begin{cases}
1, & if \ w \cdot z + b > 0 \\
-1, & if \ w \cdot z + b < 0
\end{cases}\)
Margin of a Linear Classifier
\( d = \frac{2}{\parallel w \parallel} \)
Definition 5.1 (Linear SVM: separable Case)
The learning task in SVM can be formalized as the following constrained optimization problem:
\(\displaystyle \min_w \frac{ {\parallel w \parallel}^2 } {2} \)
subject to \(y_i(w \cdot x_i + b) \geq 1, i=1,2,...,N. \)
Lagrange multiplier method
\( L_p = \frac{1}{2} {\parallel w \parallel}^2 - \displaystyle \sum_{i=1}^N \lambda_i \Big( y_i(w \cdot x_i + b) -1 \Big) \)
decision boundary
\( \Big( \displaystyle \sum_{i=1}^N \lambda_iy_ix_i \cdot x \Big) + b =0 \)
5.5.3 Linear SVM: Nonseparable Case
soft margin
approach introduce
slack variables
(\( \xi \))
\( \begin{cases}
w \cdot x_i + b \geq 1- \xi_i, if \ y_i =1, \\
w \cdot x_i + b \leq -1 + \xi_i, if \ y_i = -1,
\end{cases}\\
where \ \forall_i: \xi_i > 0 \)
5.5.4 Nonlinear SVM
to transform original coordinate space
x
to a new space\( \Phi (x) \)
then apply linear SVM
Definition 5.2 (Nonlinear SVM)
The learning task in SVM can be formalized as the following constrained optimization problem:
\(\displaystyle \min_w \frac{ {\parallel w \parallel}^2 } {2} \)
subject to \(y_i(w \cdot \Phi (x_i )+ b) \geq 1, \ i=1,2,...,N. \)
Kernel Trick
to avoid dimensionality problem
5.6 Ensemble Methods
(Classifier Combination Method)
5.6.2 Methods for Constructing an Ensemble Classifier
By manipulating the
training set
5.6.4 Bagging
bootstrap aggregating;
uniform probability distribution;
same size of original data;
63% (0.632);
improve generalization error by
reducing variance
of base classifier;
in low variance and high bias cases, it
degrades total performance
due to 63%
less susceptible to
overfitting
because examples have equal sampling chance.
5.6.5 Boosting
based on bagging add
weight
;
weight can be used:
-- sampling
-- learning base classifier;
various boosting differs on
-- how weights is updated
-- how combined base classifier's prediction;
AdaBoost
Update weight:
the (j+1)th iteration of example i
\( w_i^{(j+1)} = \frac {w_i^{(j)}} {Z_j} \times
\begin{cases}
\exp^{- \alpha_j} & \quad \text{if } C_j(x_i) = y_i \\
\exp^{\alpha_j} & \quad \text{if } C_j(x_i) \neq y_i \\
\end{cases}
\)
\( \text{where: } Z_j \text{ is nomalization factor to ensure that } \sum_i = w_i^{(j+1)} =1 \)
importance of a classifier \( C_i \)
\( \alpha_i = \frac{1}{2} \ln \Big( \frac{1- \epsilon_i }{\epsilon_i} \Big) \)
\( \epsilon_i: \text{ error rate of } C_i \)
\( \epsilon_i = \frac{1}{N} \Big[\displaystyle \sum_{j=1}^N I \Big( C_i(x_j) \neq y_j \Big) \Big] \\ \text{where: } I(p) = 1 \text{ if p is true, 0 otherwise} \)
\( \text{N: number of training examples} \)
Combine
\( \text{ the prediction of } C_j \text{ is weighted according to } \alpha_i. \)
traning error
\( e_{ensemble} \leq \Pi_i \Big[ \sqrt {\epsilon_i(1-\epsilon_i) } \Big] \)
AdaBoost is susceptible to
overfitting
because it focuses on training examples that wrongly classified.
By manipulating the
input features
Random forest
designed for
decision tree
;
each base decision tree classifier is based on an
independent set of random vectors
;
vectors are selected from a
fixed probability distribution
;
\( \text{ Generalization error } \leq \frac{ \bar{\rho} (1-s^2)} {s^2}, \\
\text{where } \bar{\rho} \text { : average correlation among the trees; } \\
\text{s: strength of the tree classifiers. }\)
s can be measured in terms of the classifier's margin:
\( \text{margin, } M(X,Y) = P( \hat{Y_\theta} = Y) - \displaystyle \max_{Z \neq Y} P( \hat{Y_\theta} = Z) \)
Forest-RI
: randomly select F input features to split at each node;
Forest-RC
: increase the features space by creating linear combinations of the input features;
Step 1: Create random vectors;
Step 2: Use random vector to build nultiple dicision trees;
Step 3: Combine decision trees;
By manipulating the
class labels
error-correcting output coding
By manipulating the
learning algorithm
5.6.1 Rationale
Conditions
base classifiers should be i
ndependent
of each other;
base classifiers are
better than random guessing
.
General Approach
step 1: create multiple data sets;
step 2: build multiple classifiers;
step 3: combine classifiers
5.6.3 Bias-Variance Decomposition
a formal method for analyzing the
prediction error
.
\( d_{f,\theta} (y,t) = Bias_\theta + Variance_f + Noise_t \\ f: classifier \\ \theta: parameter \ of \ f\\ y: estimated \ class\\ t: real \ class \)
5.7 Class Imbanlance Problem
5.7.1 Alternative Metrics
TP, FN, FP, TN
TPR(sesitivity or recall), TNR(specificity), FPR, FNR
TPR = TP / (TP + FN): the fraction of positive examples which are correctly predicted;
TNR = TN / (TN + FP): the fraction of negative examples which are correctly predicted;
FPR = FP / (FP + TN): the fraction of negative examples which are wrongly predicted;
FNR = FN / (FN+TP): the fraction of positive examples which are wrongly predicted;
\( \text{Precision, } p = \frac{TP}{TP+FP} \)
\( \text{Recall, } r = \frac{TP}{TP+FN} \)
\( \text{harmonic mean of r, p: } F_1 = \frac{2rp}{r+p} = \frac{2}{\frac{1}{r}+\frac{1}{p}} \)
\( F_\beta = \frac{(\beta^2 + 1)rp}{r+\beta^2p} \)
\( \text{Weighted accuracy} = \frac{w_1TP + w_4 TN}{w_1TP+w_2FP+w_3FN+w_4TN} \)
\( \begin{vmatrix}
Measures & w_1 & w_2 & w_3 & w_4 \\
Recall & 1& 1 & 0 & 0 \\
Precision & 1 & 0 & 1 & 0 \\
F_\beta & \beta^2+1 & \beta^2 & 1 & 0 \\
Accuracy & 1 & 1 & 1 & 1
\end{vmatrix}
\)
5.7.2 The Receiver Operating Characteristic (ROC) Curve
x axis: FPR ; y axis: TPR;
closer to upper-left (0,1), the better;
used to compare models;
5.7.3 Cost-Sensitive Learning
\( C_t(M) =TP \times C(+,+) + FP \times C(-,+) \\ + FN \times C(+,-) + TN \times C(-,-) \)
5.7.4 Sampling-Based Approaches
undersampling;
oversampling;
hybrid approach;
5.8 Multiclass Problem
1-r
: one-against-rest approach
1-1
: one-aginst-one approach
Error-Correcting Output Coding
to improve robust;
code the class labels;
key is how to choose codewords;