Please enable JavaScript.
Coggle requires JavaScript to display documents.
Machine Learning OVERVIEW - Coggle Diagram
Machine Learning OVERVIEW
Supervised
(TASK-driven)
Deep learning
Classification
Predicts a "
CATEGORICAL
target value"
(=
label
or
class
or
category
)
BINARY classification =
2 labels to predict
(LINEAR Model)
Binary Logistic Regression
y
=
qualitative
&
binary
(1 (Success) - 0 (failed))
Training set
: D = (X,y)
X (n x d) and y (n x 1)
qualitative
vectors
Formal objective
:
Determine the best link h between X and the probabilities
π = P(y = 1)
or
1 − π = P(y = 0)
=> We want to explain the membership of an instance i to a class (i.e., (y=1) success)
Hypothesis function π(x)
(5):
π(x) = P(y = 1|X = x)
where x = (1, x1, ..., xd)T (T = transposed)
π(x) ∈ [0, 1]
=> sigmoid function (logit) to go back to Parametric Method:
Interpretation
:
The probability of observing success yi for an instance i having
observed the vector (x1, ..., xd)T is π(xi ) (to compute with Eq. (5))
For the fail case it is 1 − π(xi)
Objective function
:
find the best (Ѳ1, Ѳ2, ...Ѳd), with the
Likelihood Maximization process
(IRLS)
Discrimination model
WARNING:
Not good with high dimensional data
Canot handle too many non-numerical, categorical data
MULTICLASS classification =
more than 2 labels to predict
(LINEAR Model)
Linear Discriminant Analysis (LDA)
Generative model
Discriminant Analysis
principle:
Sur la base du Training set =
on connait la probabilité à priori de chaque classe (πk)
on calcule pour chaque distribution de xi correspondant à chaque classe k,
=> la moyenne empirique et
=> la matrice de covariance
On obtient P(X|Y).
On en déduit P(Y|X)
=>
Linear Discriminant Function
that separates the
Hypothèses
:
matrices de variances covariances identiques d’un groupe à l’autre (interprétation géométrique: les nuages de points ont la même forme (et volume) dans l’espace de représentation).
On suppose une distribution gaussienne des xi pour chaque classe
PROs:
Good for multiclass classification
CONs:
bad results when non-linear boundaries
Vocabulaire
:
probabilité a priori
= probabilité initiale d’un événement avant qu’une certaine condition n’intervienne dans son contexte
probabilité a posteriori
= probabilité d’un événement après l’observation d’une donnée
Approche MAP = Maximum A Posteriori
Quadratic Discriminant Analysis
Hypothèses:
matrices de covariance différentes pour chaque classe
on suppose une distribution gaussienne des xi pour chaque classe.
PROs: more accurate classification
CONs: more parameter to ba calculated.
Métriques d'évaluation sur les test sets:
F-1, F_bêta, Rappel, LA précision
Linear SVM
2 models
Discrimination model
:
Goal = Estimate directly P(y|x)
What is learnt? = Decision boundary
models: Regressions / SVM
Generative model
:
Goal = Estimate P(x|y) and infer P(y|x)
What is learnt? = data probability distribution
models: GDA, NAive-Bayes
Regression
Predicts a "
NUMERIC
target value"
Univariate
linear regression
(only 1 explanatory variable)
(LINEAR Model)
Training set: D = (X,y)
X (n x d) and y (n x 1)
h =
hypothesis
function
h ∈ H, hypothesis space
Ѳ0 =
intercept
and Ѳ1 =
*slope
Cost (or Error or Loss) Function:
/
objective function
fonction de perte quadratique (most commonly used)
to be minimized
Minimizing the Cost Function J(Ѳ0, Ѳ1) thanks to
Gradient Descent
=> define the
HYPERPARAMETER α
=
learning rate
Formal objective:
Determine the
best line h(X) (so the best (Ѳ0, Ѳ1)
modeling the relationship between X and y, meaning
α
=
learning rate
:
Determines the size of each descent
The lower the learning rate, the slower the convergence
Too high a learning rate might cause the algorithm to diverge
Discrimination model
Multivariate
linear regression
(d explanatory variables)
(LINEAR Model
Training set: D = (X,y)
X (n x d) and y (n x 1)
Formal objective:
Determine the **best Hyperplane h(X) modeling the relationship between X and y, meaning
h =
hypothesis
function
cost function J(Θ)
to be minimized:
Discrimination model
Métrique d'évaluation sur les test sets:
MSE, RMSE, MAE, R^¨2
Generalized Linear Model (GLM)
Principle:
Training data:
Labelled
instances
Target
(or dependent) values (desired output signal)
Predictive
Model (h)
New data
Needs Training set D (X,y):
Matrix X (n x d) = input values (Explanatory variable)
vector y (n x 1 (or 2, 3,...n)) = target value
Objective function
:
Find the best
Ѳ
(
parameters
of the model) such as
y = h(X,Ѳ)
(best = minimization of the
risk/cost function
incl in the objective function)
Optimization method
(for ex: Descente gradient stochastique)
Class of Hypothesis:
parametric method (h(x))
:
X = characteristics matrix (extracteur de caractéristiques) / Explanatory variables
Ѳ
=
parameters
/
weights
of the model
Non-parametric method
Regression and / or Classification
(NON LINEAR models)
CART
Classification And Regression Trees
(Decision tree)
Vocabulary:
Partitioning has a simple description like X2 = s1 ∈ R
s1, s2, s3, s4, s5 are the five
splits
(γk
Node
characterized by split and Region
Region
Ri
Root
= first node
From a node, there are one
left child
, and one
right child
Leaves
= Final regions
Feature space
= R1 ∪ R2 ∪... ∪ Rn ) (Rk)
Principle
:
decision tree predictor ^fx
obtained by minimizing a given risk function or impurity measure (main optimization problem):
R(f ) := P(y ̸= f (x))
x ∈ X and γk the corresponding predicted value or class for the
dependent variable (called also output variable)
Several Decision tree models are possible: as much as input space partitions and set of γk
Decision tree (DT) is a piece-wise constant estimate and a finite
linear combination of hyper-rectangular indicators functions
Non-parametric model
This algorithm is generic to classification and Regression decision tree by adapting the optimization problem on the basis of the task.
PROs of CART
:
powerful machine learning tool without relying on structural
assumptions regarding the underlying space
interpretable predictors and easy
can handle categorical values (binerazing them)
can handle missing input values
CONs of CART
:
Construction process greedy
DT have a high variance => instable
K-NN (Nearest Neighbor)
Classification:
the majority class label (majority voting) is assigned;
Regression:
the average response is assigned
Non-parametric model
Each time a prediction is requested, the training data are processed. => need to be stored.
k = number of nearest neighbors
to look for
=> parameter to be defined
k is often data-dependent and heuristic-based
k can be defined thanks to cross-validation
Distance function
=> to be defined according to the type of the features.
Real-valued features:
Euclidean distance,
L1-norm
Binary-valued features:
Hamming distance,
Cosine distance
Feature normalization
:
zero mean
unit variance
k-NN is asymptotically consistent
(a theoretical property):
infinite training data and
large enough k,
=> k-NN approaches the best possible classifier (Bayes optimal).
Not good with high dimensions
Unsupervised
(DATA-driven)
Clustering
Objective of the task
=
partitioning data into homogeneous subgroupes (called clusters) that share common characteristics
Why Clustering data?
Exploratory analysis of unlabelled data
Identify:
instances that have similar behaviors (ex: Text Document corpus annotation => groups documents by subject)
communities on a social network
recurring patterns in financial transactions
pixels of the same object in an image (image segmentation)
Data visualization by looking at a representative example per cluster
Can allow a cluster instance properties to be transferred to all
instances of the same cluster (useful when labeling is expensive)
Anomaly / novelty detection
Visualization
(Association rules)
Principle:
Training data:
Unlabelled
instances
Hidden patterns
are discovered (interpretation left to the user)
Needs
:
Matrix X
a similarity measure
Objective
: Grouping similar instances
Dimensionality reduction
Generalities
:
High dimension: n<<d
"Dimension" refers to d
High-dimensional dataset = much more features than instances/observations
Issues with High dimensions:
Classical approach do not work
Overfitting
Feature selection
(Feature engineering):
To reduce the quantity of variables, so "d"
(quantuum computing may help)
Does not modify the original Database
Dimensional reduction
:
From X(n x d) to X(n x M)
with M<<d
Principle
:
find a low-dimensional representation with as
much variation as possible
PCA:
Principal Component Analysis
(linear combination)
Principal Components
:
linear combinations
of the original features X1, . . . , Xd and denoted Zm
ϕT m = (ϕ1m, . . . , ϕdm) = free parameter with m ∈ [1, . . . , M] and M<<d
∀m, Zm are mutually uncorrelated => orthogonals to each other)
∀m, Zm have a maximal variance
Zm are of decreasing importance => V (Z1) ≥ V (Z2) ≥ . . . ≥ V (ZM)
Zm are defined by iterations
PCA concept:
Analysis of the structure of the variance-covariance
matrix (i.e. variability, dispersion of data)
=> Goal: describe using M < d components a maximum of this
variability
PCA results:
reduction of the data to M new descriptors
visualization of the data in 2 or 3 dimensions (if M = 2 or 3)
interpretation of the data: inter-variable links
PCA added value for data pre-processing:
to get informations about data quality
to identify the most informative features and no-informative ones
to identify outliers or a group of atypical observations
PCA replaces
d original features Xj (having different variances and
correlation between each other)
by
M new components Zm (M << d) mutually orthogonal, i.e. Cov(Zm, Zm′) = 0, ∀m ̸= m′, having highest variance
PVE = Proportion of Variance Explained
:
∀m, PVE(Zm) > 0
Sum of all PVE from 1 to m = 1
if the original variables are strongly correlated with each other,
a reduced number of princ. comp. can explain 80% to 90% of the variance
Interpretation:
check the situation in (O,Z1,Z2) / (O,Z2,Z3), and (O,Z1,Z3) before getting conslusions
Zj Highlight linear relationships between original features
Only works with QUANTITATIVE data
Modifiies the database
Without liear combination hypothesis
T-SNE
PLS = supervised alternative to PCA
FCA / MCA = Manages categorical data
Others
Semi-supervised
Classification
Clustering
Training data:
Partly-labelled
instances
Known labels: used to label new instance
ex: Photo-hosting services such as Google photo, Instagrams)
Reinforcement
(MISTAKE-driven)
Game AI
Real-time decisions
Robot navigation
Principle:
Learning system (=
Agent
) observes the environment)
Agent takes decision based on certain policy
Good
decision is
rewarded
Bad
decision is
penalized
updates the policy (learn) and iterate until the optimal policy is found.
Online training
The algorithm learns incrementally by being fed instances sequentially
If new instances come in, the algorithm is only trained on them
=>
need to be constantly monitored
(to manage incorrectly labelled instance)
Ideal when data comes in as a continuous flow
Recommended when the training set requires a lot of computing resources.
Data
Numeric
Data => often STRUCTURED
Discrete
Continuous
Categorical
data => often UNSTRUCTURED
Ordinal
(can be ordered)
Cardinal / Nominal
(cannot be ordered)
SEMI-STRUCTURED data
Data REPRESENTATION
X =
dataset
:
set of
n instances
(xi)
Matrix
Instance:
set of
d
variables
, also called
features
or
attributes
line of the matrix (xi)
Variable
or
feature
:
column of the matrix (Xj)
Xi,j = jth variable of the instance ith instance
Data PRE-PROCESSING
Handling
Missing Values
MCAR = Missing Completely At Random
P(Xj missing|X) = P(Xj missing)
=> "missing because of bad luck"
MAR = Missing At Random
P(Xj missing|X) = P(Xj missing|Xobserved)
=> the reason why the data is missing is contained in the observed data.
MNAR = Missing Not At Random
P(Xj missing|X) = P(Xj missing|Xmissing)
=> the reason why the data is missing is linked with the missing data of other features or even a missing feature (eg. Time)
Partial Non-Response
: missing values for some features of an instance
Complete Non-Response
: missing values for all the features of an instance
Analysis
of the Missing Values:
General rules
:
Without evidence to the contrary, it is fair to assume MAR
MNAR missing data are the worst-case scenario.
What is the percentage of missing values per variable? By group of population?
Is there any predictor of missing values?
Strategy
to manage Missing Values
Listwise deletion (=
ignoring
)
ONLY for MCAR
ONLY if a handful of instances
Inputing
Univariate feature imputation
:
Missing values of a feature are determined by using the observed values of the
feature itself
Mean
Median
Most Frequent Value
Constant
Hot deck =
Random value taken among the concerned feature's value
Cold deck =
Random value taken among the concerned feature's value from ANOTHER DATASET
Nearest Neighbour
Multivariate feature imputation
:
Missing values of a feature are determined by using the observed values of
different features
.
Regression imputation
:
regression model which predicts missing values using the other features (incl. the target values)
Assumes MAR missingness
=> DANGEROUS method: Strengthens existing correlation
Stochastic Regression Imputation
:
Regression model + Noise added to the predictions.
Assumes MAR missingness
=> GOOD method: preserves the existing correlations among features
Single Imputation
Multiple Imputation
The dataset must not contain Missing Values
Handling
Outliers
(Extreme value of the distribution of a variable)
ERROR
Handling Missing Value
EXTRAORDINARY SITUATION
Must be included in the analysis
values have to be adjusted
Outlier detection:
Inter Quartile Range (IQR) method
PCA
Handling
Categorical data
Categorical data are often not managed by ML algorithms
=> to be turned into numerical data
ORDINAL features:
Map to NUMERICAL values preserving the information of their order (e.g. sizes)
Threshold encoding with binary values (0/1) (e.g. for > or <).
NOMINAL features
One-Hot Encoding
(e.g. colors):
Creation of dummy features with values 1/0
n-1 dummy features for ne NOMINAL features to be replaced (to avoid multicollinearity).
Issue:large number of features => degrade perf of ML algo
Feature hashing
:
Use a hash function to turn each value of a
nominal feature into an index of a lower-dimensional vector.
Representation learning
:
Replace each value of a nominal feature
with a low-dimensional vector called an embedding.
=> VERY USED in DEEP-LEARNING
Feature
scaling
Different scales => bad performance of Cost functions (incl Gradient descent), so bad perf of ML algo
Not necessary for:
random forests
decision trees
2 methods
Min-max scaling
:
Rescale the numeric features to the range [0, 1]
xmin,j is the minimum value of Xj
xmax,j is the maximum value of Xj
For each value xi,j of Xj,
+: bounds the value to a specific range
-: sensitive to outliers
Standardization
:
Rescale the numeric features so that values have a zero mean and unit variance
µj is the mean value of Xj.
σj is the standard deviation of Xj
For each value xi,j of Xj,
+: NOT sensitive to outliers
-: Does NOT bound the values to a spec. range