Please enable JavaScript.

Coggle requires JavaScript to display documents.

Machine Learning - Coggle Diagram

- - - - Linear regression
        
        Penalized Regression
        
        In general
        
        Many features (usually correlated) can lead to overfitting and unnecessary complexity
        
        Regularization force the algorithm to build less complex model
        
        Leads to
        
        Slighlty higher bias
        
        Significantly reduced variance
        
        Penalized regression can be used to avoid overfitting
        
        To use it - standardize the features
        
        Selecting the good alpha value is critical - cross validation is used for this
        
        Types
        
        Lasso - L1 - Manhattan norm - |e|
        
        Automatically performs a type of feature selection
        
        Shrink coeffiecients all the way to zero
        
        Helpful if you think many features might be irrelevant or in high dimensional data
        
        Elastic Net
        
        Uses both
        
        Ridge - L2 - Euclidean norm -e^2
        
        Shrink coefficients toward zero
        
        Many features is relevant but you have to control overall coefficient size for stability and reduce overfitting- handles multicollinearity
        
        Use when you think all the features are relevant but are highly correlated
        
        Evaluation Metrics
        
        R - squared R2
        
        Cross - Validation
        
        Mean Squared Error - MSE
        
        Linear regression
        
        Econometrics Approach
        
        Definiton
        
        Branch of economics that develops and uses statistical methods for estimating economic relationships
        
        Goal
        
        Testing hypothesis
        
        Predicting / Forecasting random variables
        
        Estimating relationships between random variables
        
        Steps
        
        Collecting data
        
        Quantify the model
        
        Specifying the regression model
        
        Example - Multiple Regression Model - MRM
        
        Estimating the model
        
        Ordinary Least Squared - OLS
        
        2 more items...
        
        Interpreting the model
        
        Bj - If xj increases by 1 unit, ceteris paribus, on average, y will increase by Bj units
        
        To interpret - assumptions
        
        2 more items...
        
        Evaluation Metrics
        
        MAPE
        
        MSE
        
        MAE - Mean abs. error
        
        RMSE - Root mean sq. error
        
        Adjusted R2
        
        R2
        
        Machine learning approach
        
        In general
        
        Important example of a parametric model
        
        Collection of labeled examples
        
        The model is a linear combination of features and parameterized by W and b
        
        We estimate the parameters by fitting the model to the training data
        
        Simple and interpretable approximation the unknown true
        
        Useful both conceptually and practically
        
        Rarely overfit
        
        Optimization problem
        
        Defined as
        
        Loss function
        
        Squared error loss
        
        Objective function
        
        Quadratic loss function
        
        More convenient
        
        Closed form solution
        
        Polynomial regression
        
        ML approach
        
        Definition
        
        Special case of multiple linear regression models
        
        Create new variables and then treat as multiple linear regression
        
        Tuning the hyperparameter d - x^d
        
        Optimization problem
        
        We use the same loss function as in linear regression
        
        KNN regression
        
        Definiton
        
        Closely related to the KNN classifier
        
        Predicts continous values by averaging the outputs of nearest neighbours
        
        Characteristics
        
        K=1 - very flexible
        
        Low bias
        
        High varinace
        
        As K grows - less flexible - regreesion fits get smoother
        
        High bias
        
        Low varince
        
        Choose optimal K by Cross validation and rMSE
        
        Regression Trees
        
        See details in classification part - decision trees
    - - Types
        
        Decision Trees
        
        Ensemble Learning
        
        Methods
        
        Heterogeneous learners
        
        Average predictions
        
        Voting classifiers
        
        Homogeneous learners
        
        Bagging
        
        4 more items...
        
        Boosting
        
        3 more items...
        
        Goal - reduce bias and variance at the same time
        
        Definiton
        
        Combining the predictions from a collection of models
        
        No one super accurate model -> training many low accuracy model then combine he predictions
        
        Generally more accurate and more stable predictions
        
        Definitions
        
        Decision trees are ML algorithms that are progressively divide data sets into smaller data grous based on a descriptive feature, until they reach sets that are small enough to be described by some label
        
        Top-down approach - group and label similar observations
        
        Types
        
        Regression Trees
        
        Classification trees - here we'll talk about this
        
        Terminology
        
        _
        
        ✓ Splitting
        
        ✓ Branch
        
        ✓ Decisionnode (internal node)
        
        ✓ Leafnode(terminal node)
        
        ✓ Sub-tree
        
        ✓ Depth(level)
        
        ✓ Pruning
        
        Root node
        
        DT Criteria
        
        Which split adds the most information gain?
        
        Regression tree
        
        MSE
        
        Classification tree
        
        Error rate
        
        Entropy
        
        2 more items...
        
        Gini Index
        
        2 more items...
        
        In general
        
        The prediction of the algorithm at each terminal node will be the category with the majority of data ponts - most commonly occuring class
        
        Linear Probability Model - LPM
        
        Example
        
        Y = 1 - default and 0 otherwise
        
        It is not good when the data set is imalanced
        
        Sigmoid function
        
        A monotone mapping function that has a range of 0;1 (it become a curve instead of a line)
        
        Linear regression for binary classification problems
        
        Logistic Regression
        
        Method
        
        Instead of minimizing the average loss we maximize the likelihood of the training data according to our model. Maximum likelihood estimation
        
        Likelihood function describes the joint probability of the observed data as a function of the parameters of the model
        
        Objective function
        
        Instead of maximizing the likelihood function in practice we maximize the log-likelihood function
        
        There is no closed form solution to this
        
        Have to use gradient descent
        
        Probability threshold - we classify the observations based on this ( depends on the task)
        
        K-Nearest Neighbours - KNN
        
        Definition
        
        One of the simplest and best known non-parametric supervised learning technique most often used for classification
        
        Classify new observation by finding similarities(nearness) between it and its k-nearest neighbours
        
        Contrary to other models - KNN keeps the training dataafter the model is built
        
        Hyperparameters
        
        Choice of the distance metric
        
        Value for K
        
        K = 1 - very flexible
        
        Low bias
        
        High variance
        
        As K grows - less flxible model
        
        Decision boundary gets close to linear
        
        High bias
        
        Low variance
        
        Analyst makes these decisions before running the algorithm
        
        Steps
        
        Choose the distance metric
        
        Minkowski
        
        Euclidian
        
        Manhattan
        
        Identify the K points in the training data that are closest to the new observations. Neighborhood is a "circle"
        
        Choose number of neighbours K (positive integer)
        
        Estimate the conditional probabilty for the classes
        
        Classify the observation with the class with the largest probability
        
        Other terminology
        
        KNN decision boundary
        
        Creating grids where the new observation will be classified to that class
        
        Performance metric
        
        Error rate ( 1-Accuracy)
        
        If the data is imbalanced use f1,precision or recall instead
        
        Curse of dimensionality
        
        Is a problem with the relationship between dimensionality and volume
        
        Sparsity of data occurs when moving to higher dimensions. The colume of the space represented grows so quickly that the data cannot keep up and thus become sparse
        
        Challenge because KNN is based on distance
        
        Pros and cons
        
        Pros
        
        Easy to implement for multi class problems
        
        Used both for classification and regression
        
        No assumptions - non parametric
        
        Few parameters / hyper-paramteres
        
        Intuitive and simple
        
        Cons
        
        Slow - memory based approach
        
        Curse of dimensionality
        
        Not good with multiple categorical features
        
        Choice of K
        
        No interpretation - none!
        
        Application in finance
        
        • Stock price prediction (buy/sell/hold)
        
        • Corporate bond credit rating assignments
        
        • Moneylaundering analysis
        
        • Bankcustomer profiling
        
        • Loanmanagement
        
        • Customized equity and bond index creation
        
        Bankruptcy prediction
      - Evaluation Metrics
        
        Recall
        
        Ability to find all relevant instances in a dataset
        
        TP / TP + FN
        
        F1-Score
        
        Harmonic mean - punishes extreme values
        
        2* PR / P+R
        
        More statistically significant
        
        Useful for inbalanced datasets
        
        Precision
        
        Proportion of data point our model says was relevant and were actually relevant
        
        TP/ TP + FP
        
        When my model says relevant, how likely is that is relevant
        
        MCC
        
        Most informative and important metric for any binary classifier
        
        Range between -1 and 1
        
        1 - perfect prediction
        
        0 - no better than random pr.
        
        -1 - total misclassification
        
        Accuracy
        
        TP + TN / ALL
        
        How many times your model is successful
        
        Quite primitive
        
        ROC
        
        Definition
        
        A graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
        
        Components
        
        True positive rate ( TP / TP + FN)
        
        False positive rate ( FP / FP + TN)
        
        Plot
        
        Each point on the ROC curve represents a TPR/FPR pair corresponding to a particular threshold
        
        Use cases
        
        Threshold selection
        
        Evaluating discriminatory power
        
        Model comparison
        
        Key characteristics
        
        Diagonal line - Base line
        
        Classifiier with no discriminative ability
        
        Top left Corner
        
        Perfect classifier with TPR = 1; FPR = 0
        
        Shape of the curve
        
        THe closer the ROC curve follows the top-left border the better the classifier's performance
        
        Confusion Matrix
        
        Comparing the actual y with y hat (reality vs machine learning algorithm)
        
        False positive - Type I error
        
        False negative - Type II error
        
        True positive
        
        True negative
        
        AUC
        
        Definition
        
        Area under the ROC curve - overall ability of the classifier to discriminate between positive and negative classes
        
        Range
        
        0.5 - Equivalent to random guessing
        
        1.0 - perfect classification
        
        <0.5 Worse than random guessing
        
        Use cases
        
        Performance metric
        
        Model selection - aim for the highest AUC
        
        Threshold - free evaluation
        
        Key characteristics
        
        Interpretability - probability that a randomly chosen positive instanve is ranked higher than a randomly chosen negative instance
        
        Robustness - less affected by class distribution than other metrics
        
        Cross-Validation
        
        Definition
        
        Resampling procedure to evaluate ML models by training multiple models on different subsets of the available data and assessing their performance on complementary subsets
        
        Types
        
        K-Fold Cross validation
        
        Process
        
        Divide the dataset into K equally sized folds
        
        Train the model on k-1 folds and validate it on the remaining fold
        
        Repeat the process K times, each time with a different validation fold
        
        Common choices
        
        K = 5
        
        K = 10
        
        Advantages
        
        Balances bias and variance; utilizes all data for both training and validation
        
        Leave-one-out Cross Validation - LOOCV
        
        Leave-P-Out Cross Validation
        
        Repeated Cross Validation
        
        Time-series Cross Validation
        
        Use Cases
        
        Model evaluation
        
        Hyperparameter tuning
        
        Model selection
        
        Prevent overfitting
        
        Advantages
        
        More reliable Performance estimates - averaging result through folds
        
        Efficient data usage
        
        Reduced overfitting risk
        
        Limitations
        
        Computionally Intensive
        
        Potential for data leakage
        
        Not always necessary
      - In general
        
        Qualitative variables / categorical
        
        Ordinal
        
        Nominal
        
        Classification is the prcess of predicting categoircal variables
        
        More common than regression
  - - - Types
        
        Hierarchical Clustering
        
        Agglomerative ( bottom up)
        
        Initially each point is a cluster
        
        Repeatedly combine the two nearest clsuters into one
        
        Divisive ( top-down)
        
        Start with one cluster and recursively split it
        
        Partition-based
        
        K-Means clustering
        
        In general
        
        Assumes Euclidean space or distance
        
        Start by picking k, the number of clusters
        
        Getting the k right
        
        2 more items...
        
        Initialize clusters by picking one point per cluster
        
        Example: pick one point at random, then k-1 other points, each as far away as possible
        
        Populating clusters
        
        2 - After all points are assigned, update the locations of the centroids of the k clusters
        
        3 - Reassign all points to their closest centroid
        
        1 - For each point, place it in the cluster whose current centroid it is nearest
        
        Repeat 2 and 3 until convergence.(points dont move between clusters and centroids are stabilize)
        
        In plain English
        
        1.Assignment step: assign each observation to the cluster whose mean yields the least within-cluster sum of squares. Since the sum of squares is the squared Euclidean distance, this is intuitively the nearest mean.
        
        2.Update step: calculate the new means to be the centroids of the observations in the new clusters.
        
        The algorithm has converged when the assignments no
        longer change.
      - In general
        
        The problem of clustering
        
        Given a set of points with a notion of distance between points group the points into some number of clusters so that
        
        Members of a cluster are close/similar to each other
        
        Members of different clusters are dissimilar
        
        Usually
        
        Points are in a high dimensional space
        
        Similarity is defined using a distance measure
        
        Cosine
        
        Cosine Distance is a metric used to measure the dissimilarity between two non-zero vectors in an inner product space. It quantifies the angle between the vectors rather than their magnitude, making it particularly useful in high-dimensional spaces where magnitude can be less informative
        
        Jaccard
        
        Jaccard Distance is a metric used to measure the dissimilarity between two sets. It is derived from the Jaccard Similarity Coefficient, which quantifies the similarity between finite sample sets.
        
        Good for binary clustering
        
        Euclidean
        
        Euclidean Distance is the straight-line distance between two points in Euclidean space. It is the most common distance metric used in various machine learning algorithms to measure similarity or dissimilarity between data points.
        
        Edit distance
        
        Why is it difficult?
        
        Many applications involve 10 or 10.000 dimensions
        
        High-dimensional spaces look different - almost all pairs of points are at about the same distance
        
        Example - CDs
        
        Point assignment
        
        Maintain a set of clusters
        
        Points belong to nearest cluster
    - - Principal Component Analysis - PCA
  - - - Types
        
        Basic Neural Network Architectures
        
        Recurrent Neural Networks - RNNs
        
        Natural Language Processing - NLP
      - NN definition
        
        Neural Networks are computational models inspired by the human brain, consisting of interconnected layers of nodes (neurons) that process data by learning complex patterns and relationships.
      - Characteristics
        
        Layers: Composed of an input layer, one or more hidden layers, and an output layer.
        
        Activation Functions: Introduce non-linearity (e.g., ReLU, sigmoid) to enable the network to learn complex patterns.
        
        Weights and Biases: Parameters adjusted during training to minimize prediction errors.
        
        Flexibility: Capable of handling various data types, including images, text, and numerical data.
        
        Scalability: Can be scaled with more layers and neurons to improve performance on complex tasks.
      - Training process
        
        Loss Calculation: Compute the difference between the predicted output and the actual target using a loss function (e.g., Mean Squared Error, Cross-Entropy).
        
        Backpropagation: Calculate gradients of the loss with respect to each weight by propagating the error backward through the network.
        
        Weight Update: Adjust the weights and biases using optimization algorithms (e.g., Stochastic Gradient Descent, Adam) to minimize the loss.
        
        Iteration: Repeat the process over multiple epochs until the model converges to an optimal solution.
        
        Forward Propagation: Input data passes through the network layers to produce an output.
      - Use cases in Finance
        
        Fraud Detection: Identifying unusual transaction patterns to detect fraudulent activities.
        
        Credit Scoring: Assessing the creditworthiness of individuals by analyzing financial history and behavior.
        
        Algorithmic Trading: Developing models that predict stock price movements and execute trades automatically.
        
        Risk Management: Evaluating and mitigating financial risks by forecasting market trends and potential losses.
        
        Customer Segmentation: Grouping customers based on financial behaviors for targeted marketing and personalized services.
      - Activation function
        
        Problem
        
        binary output makes network less expressive
        
        Zero gradient everywhere means backpropagation won't work
        
        Solution
        
        Types
        
        tanh
        
        Sigmoid
        
        ReLu
        
        These continous activation functions are used in multi layer neural networks
        
        Their different shapes results in different characteristics, suitable for different tasks
      - Deep networks
        
        Deep learning” concerns neural network models with many hidden layers, used in both unsupervised and supervised learning contexts
        
        It is difficult to train deep models effectively, so special techniques have been developed for this class of models.
        
        Specialised software packages (e.g. TensorFlow, Keras) and computer hardware are available
        
        Unstructured data - Deep networks are especially well suited to working with unstructured input data, which doesn’t have easily definable informative features.
        
        Text
        
        Speech
        
        Images
        
        Music
- - - - Consist of
        
        y - response, dependent variables, output, Target
        
        estimated from the data automatically
        
        x - predictors, independent variables, input, Features
        
        set manually prior to training
        
        Theta - estimates, specifications, Parameters
      - Estimation purpose
        
        Inference
        
        Prediction
      - Types
        
        Parametric
        
        Advantage
        
        Faster
        
        Less data
        
        Simpler
        
        Disadvantage
        
        Limited complexity
        
        Non parametric
    - - Partitioning of the dataset
        
        Validation set
        
        to validate the model
        
        Test set
        
        To test the model's ability to predict well on new data - generalize
        
        Training set
        
        To train the model
      - Generalization
        
        Large dataset needed
        
        If you dont have it
        
        Resample!
        
        Combine the training and validation sets and use cross validation
        
        K-fold Cross Validaton
        
        K equal parts and fit the model on k-1 folds, then do this k times and combine the results
    - - Variance
        
        The amount by which the model’s predictions fluctuate given different training data. High variance models (e.g., overfitted) may capture noise in the training set, leading to poor generalization.
      - Tradeoff
        
        Increasing model complexity typically reduces bias but raises variance; decreasing complexity does the opposite. The goal is to find a balance that minimizes total error on unseen data.
      - Bias
        
        The error introduced by approximating a real-world problem (which may be extremely complex) by a simplified model. High bias models (e.g., underfitted) may consistently miss relevant relationships, leading to underfitting.
      - MSE decomposition
        
        Model bias
        
        Irreducible error - cant be fixed w modeling
        
        Model variance
      - Goal
        
        Minimize the sum of variance and bias
    - - In general
        
        Fitted algorithm does not generalize well to new data
        
        Fits training data too well
        
        Fits the noise in training data (find a pattern that doen not exist
        
        The algorith memorized the data and did not learn from it
        
        Model is too complex
      - Mitigation
        
        Complexity reduction - regularization, feature selection
        
        Cross validation - estimate the performance in test set
        
        Collect more data - can reduce and variance
    - - Definintion
        
        Is a plot that shoes the relationship between the amount of training data and the performance of the ML model
      - Purpose
        
        Diagnose whether a model has high bias, high variance or just right
    - - Terminology
        
        Learning - finding the model weights (parameter values)
        
        Cost function - tell us how good our model is at making predictions for a given set of parameters
        
        Own curve
        
        Own gradient
        
        Slope of these helps how to update the parameters to make the model more accurate
        
        If the cost function is continous and differentiable
        
        Gradient Descent
        
        Iterative optimization algorithm for finding the minimum function
        
        Start at a random point->take steps to the negative gradient
        
        Learning rate
        
        If alpha is to small the gradient descent will be slow
        
        If alpha is too large, the gradient descent can even diverge
        
        Disadvantages
        
        Single batch - use the entire training set to update parameters
        
        Sensitive to the choice of the learning rate
        
        Slow for large datasets
        
        Stochastic GD
        
        Version of the algorithm that speeds up the computation by approximating the gradient using smaller batches(subsets) of the training data
        
        SDG itself has various upgrades
      - Process - Step by step
        
        Training
        
        Random initializaton
        
        Prediction
        
        Prediction train, validation
        
        Model
        
        Model architecture
        
        Performance
        
        Evaluation
        
        MSE
        
        Accuracy
        
        Data
        
        Validation
        
        Test
        
        Training
        
        Feature scaling
        
        Critical step during the preprocessing of data before creating the machine learning model
        
        Essential for ML model that calculate distances between data
        
        Could
        
        Avoid numerical overflow and speed up the algo
        
        Reduce dominant effects of specific variables
        
        Types
        
        Standardization - Z score
        
        Normalization
        
        Min Max scaler over 0;1
        
        Min Max scaler over -1,1
        
        Mean normalization
        
        Optimization
        
        GD
        
        Updating hyperparameters of the model using validation set
        
        SGD
        
        Generalization
        
        Model comparison - test set
  - - - _
        
        Strength - Understand causal relationship & behaviour
        
        Interpretability - High
        
        Model choice - Parameter significance & in-sample goodness of fit
        
        Dimensions / scalability - Mostly low dimensional data
        
        Data type - Structured
        
        Data size - Any reasonable set
        
        Driver - Math, theory, hypothesis
        
        Focus - Hypothesis testing & interpretability
    - - Data size - Big data
      - Dimensions / scalability - High dimensional data
      - Driver - Fitting data
      - Model choice - Cross-validation of predictive accuracy on partitions of data
      - Interpretability - Low
      - Strength - Prediction (forecasting and nowcasting)
      - Focus - Predictive accuracy
      - Data type - Structured, unstructured, semi-structured