Please enable JavaScript.
Coggle requires JavaScript to display documents.
COMP551 Applied Machine Learning - Coggle Diagram
COMP551 Applied Machine Learning
KNN
Lazy Learner (no training process)
HyperParam
different measures of distance
K
larger: underfit, high bias, smoother boundary
smaller: overfit, high variance, rough boundary
Predict by voting from nearest (most similar) samples
Improvement
Set weight corresponds to inverse distance (not uniform)
Scale more important features / normalize features with different scales
Decision Tree
Cost
misclassification rate
entropy cost = H(y) = \( -\sum_{c=1}^{C} p(y=c) \times log[p(y=c)]\)
Gini index cost = \( \sum_{c=1}^{C} p(y=c)(1-p(y=c)) \)
HyperParam
cost function
minimum impurity decrease
Divide the input space into regions R1, R2,..., Rk using a tree structure. Assign a prediction label to each region.
Easily overfit: use Tree Pruning
Linear Regression
Square error loss (L2 loss): \( L(y, \hat{y}) = \frac{1}{2} (y-\hat{y}) ^2 \)
Maximize the likelihood of Gaussian distribution
Learning process: find the weight vector that minimizes of the loss. close form:\( w^{*} = (X^{T}X)^{-1}X^{T}y\) or gradient descent
Fit a line using linear functions
Improvement: feature engineering (e.g. use Gaussian bases as nonlinear features)
Prediction: \( \hat{y} = w^{T} x^{(n)}\)
Logistic Regression
Logistic function: \( \sigma(z) \) help to squash all the data points on each side of the decision boundary close
Log cross-entropy loss:\( L(y, \hat{y}) = -ylog(\hat{y}) - (1-y)log(1-\hat{y})\)
Maximize the likelihood of Bernoulli distribution
Solve linear classification
Perception & SVM
SVM
Hard margin
\( y^{(n)}(w^T x^{(n)} + w_0) \geq 1\)
\( argmin_{w_0, w} ||w||^2 \)
Soft margin: L2 regularized hinge loss minimization
Margin: the distance of the closest point to the decision boundary
Perception: \( J(w) = - y^{(n)} (wx^{(n)}+w_0) \)
oscillate if not linearly seperable
boundary may not be optimal
MLP(multi-layer perceptions)
Activation Function
Sigmoid, Tanh: leads to vanishing gradient (derivate is very small for values away from 0)
ReLU: fix vanishing gradient, help train deep networks (better with leaky ReLU)
Universal approximation power
Handle Non-linear decision boundary
Feature engineering: add non-linear features
Multi-layer structure: learn nonlinear features adaptively
Regularization
Noise Robustness (Use SDG to jump out of sharp min to flat min)
Early Stopping (Stop when validation loss stops to decrease)
Data augmentation (add transformation of instance to dataset)
Dropout (Randomly remove a subset of neurons during training
Differentiation
symbolic differentiation
automatic differentiation
backpropagation
numerical differentiation
Improve optimization
Residual Connection
Batch normalization
Use ReLU
CNN
Parameter Sharing
Convolution
Idea: MLP knows nothing about the image structure
Padding
Reduce the size of output, give you invariance to small translations
Pooling
Strided Convolution
Channels: learning complex features, increase expressiveness
Concept & Techniques
Train-validation-test split
Confusion Matrix
Cross Validation
L-fold (leave one out)
Train & val are in the same set, split them into equally sized pieces
Curse of dimensionality (need more data, have similar distance) ---> Manifold learning
No free lunch ---> the same algorithm can't perform well on all problem
Gradient Descent
HyperParam
Batch-size B
Large: always converge to the absolute min
Small: training process becomes faster
Momentum: add a moving average of gradient to reduce oscillation
Learning rate \( \alpha \)
Adagrad: adaptive gradient ( update different parameters at different learning rate)
\( w^{(t+1)} = w^{(t)} - \alpha \triangledown J(w^{(t)}) \)
\( \triangledown J(w) = \sum_{n} x^{(n)} (\hat{y}^{(n)}- y^{(n)}) \)
Regularization
L1: Laplace Prior \( J(w) = J(w) + \lambda ||w|| \)
L2: Gaussian Prior \( J(w) = J(w) + \frac{\lambda}{2} ||w||^2 \)
Idea: prevent weights from being too large, and thus solves overfitting in very complex models
Bias-Variance tradeoff
High bias = simple model = large training error = underfit
High variance = complex model = small training error, large test error = overfit
Bootstrap Aggregation: reduce variance, same bias
Random Forrest