Please enable JavaScript.

Coggle requires JavaScript to display documents.

Machine Learning OVERVIEW - Coggle Diagram

- - - - BINARY classification =
        2 labels to predict
        (LINEAR Model)
        
        Binary Logistic Regression
        
        y = qualitative & binary
        (1 (Success) - 0 (failed))
        
        Training set: D = (X,y)
        X (n x d) and y (n x 1) qualitative vectors
        
        Formal objective:
        Determine the best link h between X and the probabilities
        π = P(y = 1) or 1 − π = P(y = 0)
        => We want to explain the membership of an instance i to a class (i.e., (y=1) success)
        
        Hypothesis function π(x) (5):
        π(x) = P(y = 1|X = x) where x = (1, x1, ..., xd)T (T = transposed)
        π(x) ∈ [0, 1]
        
        => sigmoid function (logit) to go back to Parametric Method:
        
        Interpretation:
        
        The probability of observing success yi for an instance i having
        observed the vector (x1, ..., xd)T is π(xi ) (to compute with Eq. (5))
        
        For the fail case it is 1 − π(xi)
        
        Objective function:
        find the best (Ѳ1, Ѳ2, ...Ѳd), with the Likelihood Maximization process (IRLS)
        
        Discrimination model
        
        WARNING:
        
        Not good with high dimensional data
        
        Canot handle too many non-numerical, categorical data
      - MULTICLASS classification =
        more than 2 labels to predict
        (LINEAR Model)
        
        Linear Discriminant Analysis (LDA)
        
        Generative model
        
        Discriminant Analysis principle:
        Sur la base du Training set =
        
        on connait la probabilité à priori de chaque classe (πk)
        
        on calcule pour chaque distribution de xi correspondant à chaque classe k,
        => la moyenne empirique et
        => la matrice de covariance
        On obtient P(X|Y).
        On en déduit P(Y|X)
        
        => Linear Discriminant Function that separates the
        
        Hypothèses:
        
        matrices de variances covariances identiques d’un groupe à l’autre (interprétation géométrique: les nuages de points ont la même forme (et volume) dans l’espace de représentation).
        
        On suppose une distribution gaussienne des xi pour chaque classe
        
        PROs:
        
        Good for multiclass classification
        
        CONs:
        
        bad results when non-linear boundaries
        
        Vocabulaire:
        
        probabilité a priori = probabilité initiale d’un événement avant qu’une certaine condition n’intervienne dans son contexte
        
        probabilité a posteriori = probabilité d’un événement après l’observation d’une donnée
        
        Approche MAP = Maximum A Posteriori
        
        Quadratic Discriminant Analysis
        
        Hypothèses:
        
        matrices de covariance différentes pour chaque classe
        
        on suppose une distribution gaussienne des xi pour chaque classe.
        
        PROs: more accurate classification
        CONs: more parameter to ba calculated.
      - Métriques d'évaluation sur les test sets:
        F-1, F_bêta, Rappel, LA précision
      - Linear SVM
    - - Discrimination model:
        
        Goal = Estimate directly P(y|x)
        
        What is learnt? = Decision boundary
        
        models: Regressions / SVM
      - Generative model:
        
        Goal = Estimate P(x|y) and infer P(y|x)
        
        What is learnt? = data probability distribution
        
        models: GDA, NAive-Bayes
  - - - Univariate linear regression
        (only 1 explanatory variable)
        (LINEAR Model)
        
        Training set: D = (X,y)
        X (n x d) and y (n x 1)
        
        h = hypothesis function
        h ∈ H, hypothesis space
        
        Ѳ0 = intercept and Ѳ1 = *slope
        
        Cost (or Error or Loss) Function: / objective function fonction de perte quadratique (most commonly used) to be minimized
        
        Minimizing the Cost Function J(Ѳ0, Ѳ1) thanks to Gradient Descent
        => define the HYPERPARAMETER α = learning rate
        
        Formal objective:
        Determine the best line h(X) (so the best (Ѳ0, Ѳ1) modeling the relationship between X and y, meaning
        
        α = learning rate:
        
        Determines the size of each descent
        
        The lower the learning rate, the slower the convergence
        
        Too high a learning rate might cause the algorithm to diverge
        
        Discrimination model
      - Multivariate linear regression (d explanatory variables)
        (LINEAR Model
        
        Training set: D = (X,y)
        X (n x d) and y (n x 1)
        
        Formal objective:
        Determine the **best Hyperplane h(X) modeling the relationship between X and y, meaning
        
        h = hypothesis function
        
        cost function J(Θ) to be minimized:
        
        Discrimination model
      - Métrique d'évaluation sur les test sets:
        MSE, RMSE, MAE, R^¨2
      - Generalized Linear Model (GLM)
  - - - Vocabulary:
        
        Partitioning has a simple description like X2 = s1 ∈ R
        
        s1, s2, s3, s4, s5 are the five splits (γk
        
        Node characterized by split and Region
        
        Region Ri
        
        Root = first node
        
        From a node, there are one left child, and one right child
        
        Leaves = Final regions
        
        Feature space = R1 ∪ R2 ∪... ∪ Rn ) (Rk)
      - Principle:
        
        decision tree predictor ^fx
        
        obtained by minimizing a given risk function or impurity measure (main optimization problem):
        R(f ) := P(y ̸= f (x))
        
        x ∈ X and γk the corresponding predicted value or class for the
        dependent variable (called also output variable)
      - Several Decision tree models are possible: as much as input space partitions and set of γk
        
        Decision tree (DT) is a piece-wise constant estimate and a finite
        linear combination of hyper-rectangular indicators functions
      - Non-parametric model
      - This algorithm is generic to classification and Regression decision tree by adapting the optimization problem on the basis of the task.
      - PROs of CART:
        
        powerful machine learning tool without relying on structural
        assumptions regarding the underlying space
        
        interpretable predictors and easy
        
        can handle categorical values (binerazing them)
        
        can handle missing input values
      - CONs of CART:
        
        Construction process greedy
        
        DT have a high variance => instable
    - - Classification:
        the majority class label (majority voting) is assigned;
        
        Regression:
        the average response is assigned
      - Non-parametric model
      - Each time a prediction is requested, the training data are processed. => need to be stored.
      - k = number of nearest neighbors to look for
        => parameter to be defined
        k is often data-dependent and heuristic-based
        k can be defined thanks to cross-validation
      - Distance function
        => to be defined according to the type of the features.
        
        Real-valued features:
        
        Euclidean distance,
        
        L1-norm
        
        Binary-valued features:
        
        Hamming distance,
        
        Cosine distance
      - Feature normalization:
        
        zero mean
        
        unit variance
      - k-NN is asymptotically consistent (a theoretical property):
        
        infinite training data and
        
        large enough k,
        => k-NN approaches the best possible classifier (Bayes optimal).
      - Not good with high dimensions
- - - - Exploratory analysis of unlabelled data
      - Identify:
        
        instances that have similar behaviors (ex: Text Document corpus annotation => groups documents by subject)
        
        communities on a social network
        
        recurring patterns in financial transactions
        
        pixels of the same object in an image (image segmentation)
      - Data visualization by looking at a representative example per cluster
      - Can allow a cluster instance properties to be transferred to all
        instances of the same cluster (useful when labeling is expensive)
  - - - Does not modify the original Database
    - - Principle:
        find a low-dimensional representation with as
        much variation as possible
      - PCA:
        Principal Component Analysis
        (linear combination)
        
        Principal Components:
        
        linear combinations of the original features X1, . . . , Xd and denoted Zm
        
        ϕT m = (ϕ1m, . . . , ϕdm) = free parameter with m ∈ [1, . . . , M] and M<<d
        
        ∀m, Zm are mutually uncorrelated => orthogonals to each other)
        
        ∀m, Zm have a maximal variance
        
        Zm are of decreasing importance => V (Z1) ≥ V (Z2) ≥ . . . ≥ V (ZM)
        
        Zm are defined by iterations
        
        PCA concept:
        Analysis of the structure of the variance-covariance
        matrix (i.e. variability, dispersion of data)
        => Goal: describe using M < d components a maximum of this
        variability
        
        PCA results:
        
        reduction of the data to M new descriptors
        
        visualization of the data in 2 or 3 dimensions (if M = 2 or 3)
        
        interpretation of the data: inter-variable links
        
        PCA added value for data pre-processing:
        
        to get informations about data quality
        
        to identify the most informative features and no-informative ones
        
        to identify outliers or a group of atypical observations
        
        PCA replaces
        d original features Xj (having different variances and
        correlation between each other)
        by
        M new components Zm (M << d) mutually orthogonal, i.e. Cov(Zm, Zm′) = 0, ∀m ̸= m′, having highest variance
        
        PVE = Proportion of Variance Explained:
        
        ∀m, PVE(Zm) > 0
        
        Sum of all PVE from 1 to m = 1
        
        if the original variables are strongly correlated with each other,
        a reduced number of princ. comp. can explain 80% to 90% of the variance
        
        Interpretation:
        
        check the situation in (O,Z1,Z2) / (O,Z2,Z3), and (O,Z1,Z3) before getting conslusions
        
        Zj Highlight linear relationships between original features
        
        Only works with QUANTITATIVE data
      - Modifiies the database
      - Without liear combination hypothesis
        
        T-SNE
        
        PLS = supervised alternative to PCA
        
        FCA / MCA = Manages categorical data
- - - - MCAR = Missing Completely At Random
        P(Xj missing|X) = P(Xj missing)
        => "missing because of bad luck"
      - MAR = Missing At Random
        P(Xj missing|X) = P(Xj missing|Xobserved)
        => the reason why the data is missing is contained in the observed data.
      - MNAR = Missing Not At Random
        P(Xj missing|X) = P(Xj missing|Xmissing)
        => the reason why the data is missing is linked with the missing data of other features or even a missing feature (eg. Time)
      - Partial Non-Response: missing values for some features of an instance
        Complete Non-Response: missing values for all the features of an instance
      - Analysis of the Missing Values:
        
        General rules:
        
        Without evidence to the contrary, it is fair to assume MAR
        
        MNAR missing data are the worst-case scenario.
        
        What is the percentage of missing values per variable? By group of population?
        
        Is there any predictor of missing values?
      - Strategy to manage Missing Values
        
        Listwise deletion (=ignoring)
        
        ONLY for MCAR
        
        ONLY if a handful of instances
        
        Inputing
        
        Univariate feature imputation:
        Missing values of a feature are determined by using the observed values of the feature itself
        
        Mean
        
        Median
        
        Most Frequent Value
        
        Constant
        
        Hot deck =
        Random value taken among the concerned feature's value
        
        Cold deck =
        Random value taken among the concerned feature's value from ANOTHER DATASET
        
        Nearest Neighbour
        
        Multivariate feature imputation:
        Missing values of a feature are determined by using the observed values of different features.
        
        Regression imputation:
        
        regression model which predicts missing values using the other features (incl. the target values)
        
        Assumes MAR missingness
        => DANGEROUS method: Strengthens existing correlation
        
        Stochastic Regression Imputation:
        
        Regression model + Noise added to the predictions.
        
        Assumes MAR missingness
        => GOOD method: preserves the existing correlations among features
        
        Single Imputation
        Multiple Imputation
      - The dataset must not contain Missing Values
    - - ERROR
        
        Handling Missing Value
      - EXTRAORDINARY SITUATION
        
        Must be included in the analysis
        
        values have to be adjusted
      - Outlier detection:
        
        Inter Quartile Range (IQR) method
        
        PCA
    - - Categorical data are often not managed by ML algorithms
        => to be turned into numerical data
      - ORDINAL features:
        
        Map to NUMERICAL values preserving the information of their order (e.g. sizes)
        
        Threshold encoding with binary values (0/1) (e.g. for > or <).
      - NOMINAL features
        
        One-Hot Encoding (e.g. colors):
        
        Creation of dummy features with values 1/0
        
        n-1 dummy features for ne NOMINAL features to be replaced (to avoid multicollinearity).
        Issue:large number of features => degrade perf of ML algo
        
        Feature hashing:
        Use a hash function to turn each value of a
        nominal feature into an index of a lower-dimensional vector.
        
        Representation learning:
        Replace each value of a nominal feature
        with a low-dimensional vector called an embedding.
        => VERY USED in DEEP-LEARNING
    - - Different scales => bad performance of Cost functions (incl Gradient descent), so bad perf of ML algo
      - Not necessary for:
        
        random forests
        
        decision trees
      - 2 methods
        
        Min-max scaling:
        
        Rescale the numeric features to the range [0, 1]
        
        xmin,j is the minimum value of Xj
        
        xmax,j is the maximum value of Xj
        
        For each value xi,j of Xj,
        
        +: bounds the value to a specific range
        -: sensitive to outliers
        
        Standardization:
        
        Rescale the numeric features so that values have a zero mean and unit variance
        
        µj is the mean value of Xj.
        
        σj is the standard deviation of Xj
        
        For each value xi,j of Xj,
        
        +: NOT sensitive to outliers
        -: Does NOT bound the values to a spec. range