Please enable JavaScript.

Coggle requires JavaScript to display documents.

Machine Learning Mind-Map (By: Sandarva Khanal), Neural Networks,…

- - - - Training/ Learning: Use available data to optimize model parameters so that the model performs well on unseen data
        
        Data-oriented technique that enables computers to learn from experience, instead of working on pre-defined rules
        
        relevant and wide-ranging data needed
        
        described in terms of features and outcomes
        
        split data into training/ testing dataset
        
        implicitly learn the rules using training data
        
        validate the model using testing dataset
        
        process is not linear. It's self-correcting & iterative
        
        In the heart of “training” a machine learning algorithm, we provide some data and an objective function, also called a loss or a cost function (e.g. the sum of squares) and it will find the right coefficients (m and b) to fulfill that objective.
        
        When we “train” a machine learning model we really are minimizing the loss function.
        
        Loss in this case can be thought of as the distance between a model’s predictions and the actuals during training time.
        
        The final loss that remains in the loss function is called training loss or error.
        
        Once the model has been trained, we want to measure its "generalization" on an unseen data (or test set) to measure the model’s actual performance and minimize. In other words, the goal is to minimize the test error and not necessarily to minimize the training error.
        
        The problem is that minimizing the training error does not always lead to the model generalizing well (i.e. minimizing the test error).
        
        Test Error Categories
        
        Underfit
        
        When the trained model predicts poorly on the training data set and does not generalize well to unseen data
        
        training error is large
        
        test error is large too
        
        2 more items...
        
        Overfit
        
        When the trained model predicts accurately on the training data set but does not generalize well to unseen data
        
        training error is small
        
        test error is large
        
        1 more item...
      - Supervised Learning Models Selection
        
        Type
        
        Supervised Learning Models
        
        .#
        
        01
        
        02
        
        1 more item...
        
        Linear Regression
        
        Regression
        
        Classification
        
        1 more item...
        
        Naive Bayes
        
        Logistic Regression
        
        1 more item...
        
        Complexity
        
        Data Size Needed
        
        Interpretability
        
        Low
        
        Small
        
        High
        
        Low
        
        Small
        
        1 more item...
        
        Low
        
        2 more items...
      - Unsupervised Learning Models Selection
        
        Type
        
        Complexity
        
        Data Size Needed
        
        Interpretability
        
        Low
        
        Small - Moderate
        
        High
        
        Medium
        
        Moderate
        
        1 more item...
        
        Medium
        
        2 more items...
        
        Unsupervised Learning Models
        
        .#
        
        01
        
        02
        
        1 more item...
        
        Z-Score
        
        Anomaly Detection
        
        Clustering
        
        1 more item...
        
        K-Mean Clustering
        
        Hierarchical Clustering
        
        1 more item...
- - - - Data Understanding
        with EDA + Visualization
        (Understanding structure, format,
        quality, correlation, trend, outliers)
        
        Data Preparation
        Data Munging/ Wrangling/ Cleaning
        
        Train/ Test Split
        
        Feature Scaling/ Rescaling
        
        Modeling
        Build/ Execute the models
        
        Model Evaluation
        Test Models / Validate results
        
        4 more items...
        
        No-Free-Lunch Theorem
        
        1 more item...
        
        N-Fold Cross-validation
        
        CV is an approach to create multiple independent training and testing samples and still use the entire dataset for both modeling and evaluation
        
        Stratify for imbalanced data
        
        To begin with, the set of pre-classified examples is divided into N equally sized (or almost equally-sized) subsets called “folds.”
        
        N-fold cross-validation then runs N experiments. In each:
        
        one of the N subsets is removed so as to be used only for testing (withholding a different subset for testing)
        
        The training is then carried out on the union of the
        remaining N -1 subsets
        
        Then calculate the mean value and the
        standard deviation of the chosen performance criterion
        
        CV helps mitigate overfitting by reducing selection bias and reducing prediction variance
        
        Exploratory Data Analysis (EDA)
        
        A visual guide to understanding data before modeling
        
        2 more items...
- - - - Predicts a probability of an outcome...This in turn can be used for classification, which is predicting categories
        
        Offers more practicality and performance than other types of supervised machine learning.
        
        B0 and B1 coefficients are packaged inside an exponent resembling a linear function.
        
        This linear function in the exponent is known as the log-odds function.
        \(Probability = 1 / (1 + e^-Linear Combination)\)
        
        If the probability is greater than 0.5, you predict the outcome as "true" (1); otherwise, you predict it as "false" (0).
        
        When the log-odds is 0.0 on the line, then the probability of the logistic curve is at 0.5.
        
        Fitting the Logistic Curve to a given training data is done by solving for \(B0\) and \(B1\) coefficients using maximum likelihood estimation, instead of using least squares that is used in Linear Regression.
        
        \(R^2\) metric indicates how well a given independent variable explains a dependent variable.
        
        The accuracy metric is horrendously misleading for classification problems. If a vendor, consultant, or data scientist ever tries to sell you a classification system on claims of accuracy, ask for a confusion matrix.
      - Spam Filtering
      - Fraud Detection
      - Customer Retention
      - Diagnostics
      - Image Classification
  - - - show a linear relationship between variables
        \(y = m.x + b\)
        
        allows us to make predictions on new datapoint
        
        Another benefit is we can analyze variables for possible relationships and hypothesize that correlated variables are causal to one another.
        
        We should not use the linear regression to make predictions outside the range of data we have
        
        The error/ residual is the numeric difference between predicted y-values (from the line) and the actual y-values (from the data)
        
        The goal of this method is to determine the linear model that minimizes the sum of the squared errors between the observations in a dataset and those predicted by the model.
        
        To get to that “best fit”, we minimize the squares, or more specifically the sum of the squared residuals. The lower we can make that number, the better the fit.
        
        Visually, think of it as overlaying a square on each residual and each side is the length of the residual. We sum the area of all these squares
        \(SSE = \sum_. (y-y')^2\)
        
        We have to square the residuals before summing them because:
        
        just adding them up without squaring will not work because the negatives will cancel out the positive;
        
        adding the absolute values are mathematically inconvenient they do not work well with derivatives that we are going to use later for gradient descent.
        
        So we need to find the derivatives of our sum of squares function with respect to \(m\) and \(b\).
      - Weather/ Growth Predictions
      - New Insights
      - Market Forecasting
      - Process Optimization
    - - Ensemble Learning Techniques
        
        Boosting
        
        Gradient Boosting Regressor (XGBoost)
        
        Gradient Boosting trains new models on the residual errors of previous models using gradient descent optimization to refine predictions.
        
        Each new model is trained to correct the residuals of previous predictions by following the loss function gradient’s direction, facilitating more precise and efficient optimization.
        
        XGBoost (variant of XGBoost) is an optimized and scalable implementation of gradient boosting known for speed and high performance on large datasets
        
        predict the ride-share fare amount
        
        AdaBoost
        
        Adjusts data point weights after each iteration and is highly effective for binary classification tasks.
        
        Like bagging, boosting, yet another approach for improving the predictions results and can be applied to many statistical learning methods for regression or classification.
        
        Boosting works in a similar way (combining a large number of decision trees), except that:
        
        The trees are grown sequentially (each tree is grown using information from previously grown trees). In other words, Models are trained sequentially, learning from the errors of the preceding ones;
        
        Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
        
        The construction of each tree depends strongly on the trees that have already been grown
        
        Reduces both bias and variance via sequential learning
        
        Individual predictors are considered weak (underfitting) and are constructed in series, one following another.
        
        Each model attempts to rectify the errors of its predecessor, thereby reducing the bias introduced by each weak model.
        
        Boosting algorithms include notable examples like AdaBoost (Adaptive Boosting), Gradient Boosting, and its variants XGBoost and LightGBM.
        
        IMPLEMENTATION:
        
        Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome Y, as the response.
        
        We then add this new decision tree into the fitted function in order to update the residuals. Each of these trees can be rather small, with just a few terminal nodes.
        
        Boosting converts multiple weak learners into a single, strong predictive model through a sequential training process that focuses on reducing errors. The key steps are:
        
        Initialize Weights: Assign equal weights to all training instances to represent their initial importance.
        
        Sequential Training: Train the first weak learner and evaluate its predictions. Misclassified instances are given higher weights to ensure the next learner focuses on harder cases.
        
        Iterative Refinement: Repeat the process, with each new learner addressing the mistakes of the current ensemble, gradually improving overall accuracy.
        
        Aggregate Predictions: Combine outputs from all learners using weighted voting or averaging, giving more influence to models with higher accuracy.
        
        ADVANTAGES:
        
        Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. In general, statistical learning approaches that learn slowly tend to perform well.
        
        Effectively reduces bias and significantly improves predictive accuracy.
        
        Focuses learning on hard-to-classify instances leading to strong performance.
        
        Flexible and applicable to both classification and regression tasks.
        
        Often outperforms single models and other ensemble methods on complex datasets.
        
        Using smaller trees can aid in interpretability as well
        
        TUNING:
        
        Boosting has three tuning parameters:
        
        The number of trees B (boosting can overfit if B is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select B);
        
        The shrinkage parameter λ (this controls the rate at which boosting learns. Typical values are 0.01 or 0.001);
        
        The interaction depth d of splits in each tree (this controls the complexity of the boosted ensemble since d splits can involve
        at most d variables. Often d = 1 works well, in which case each tree is a stump, consisting of a single split).
        
        Bagging
        
        Bootstrap aggregating, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method
        
        It can be applied with any underlying model, such as SVM, KNN, Logistic Regression, etc.
        
        A real-world analogy is consulting several experts on a complex issue. Each expert, although competent, might have slightly different experiences and viewpoints. Averaging their opinions often results in better decisions than relying on a single expert.
        
        While bagging can improve predictions for many regression methods, it is particularly useful and frequently used in the context of decision trees.
        
        MOTIVATION:
        
        The decision trees suffer from high variance.
        
        If we split the high variance training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.
        
        averaging a set of observations reduces variance
        
        Hence a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions
        
        Bagging has been demonstrated to give impressive improvements in accuracy by combining together hundreds or even thousands of trees into a single procedure.
        
        Bagging typically results in improved accuracy over
        prediction using a single tree, at the expense of interpretability.
        
        IMPLEMENTATION:
        
        We first do bootstrapping, by taking repeated samples (random sampling with replacement) from the (single) training data set to created multiple smaller training data sets.
        
        So, in this approach we generate B different bootstrapped training data sets.
        
        On average, each bagged tree makes use of around two-thirds of the observations.
        
        Then we do training of ML model(s) in parallel and independently using individual data set to get different predictions.
        
        At the end, we do aggregating, by taking majority vote (for classification problem), or average (for regression problem) of all the different predictions.
        
        In short, to apply bagging to regression trees, we simply construct B regression trees using B bootstrapped training sets and average the resulting predictions.
        
        These trees are grown deep and are not pruned. Hence each individual tree has high variance, but low bias.
        
        Averaging/ aggregating these B trees reduces the variance. The main strength of bagging lies in its ability to reduce variance through averaging without increasing bias.
        
        The number of trees B is not a critical parameter with bagging. In practice we use a value of B sufficiently large that the error has settled down.
        
        Random Forest
        
        Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees.
        
        As in bagging, we build a number of decision trees on bootstrapped training samples.
        
        But when building these decision trees, each time a split in a tree is considered, a random sample of \(m\) predictors is chosen as split candidates from the full set of p predictors.
        
        A fresh sample of \(m\) predictors is taken at each split, and typically we choose \(m = log_2(p)\) or \(m ≈ √p\) — that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors (e.g. 4 out of the 13).
        
        The main difference between bagging and random forests is the choice of predictor subset size \(m\). When \(m = p\), Random Forest is same as Bagging. So, Bagging is a special case of Random Forest, with \(m = p\)
        
        In other words, in building a random forest, at each split in the tree, the algorithm is not even allowed to consider a majority of the available predictors.
        
        This may sound crazy, but it has a clever rationale.
        
        MOTIVATION:
        
        Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors.
        
        Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split.
        
        Consequently, all of the bagged trees will look quite similar to each other. Hence the predictions from the bagged trees will be highly correlated.
        
        Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. In particular, this means that bagging will not lead to a substantial reduction in variance over a single tree in this setting.
        
        Hence, to increase the performance of the ensemble model, we need to de-correlate the bagged trees
        
        Random forests overcome this problem by forcing each split to consider only a subset of the predictors. Therefore, on average \((p − m)/p\) of the splits will not even consider the strong predictor, and so other predictors (week predictors in the data set) will have more of a chance.
        
        We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable.
        
        Ensemble Learning techniques are supervised learning techniques used in Machine Learning to improve overall performance by combining the predictions from multiple models.
      - INTRODUCTION:
        
        It is a simple non-parametric supervised learning method used for classification and regression tasks.
        
        The model is represented in the form of a tree structure with decision and leaf nodes
        
        Creates a model that predicts the value of a target variable by learning decision rules
        
        The tree can be translated into a rule set, if-then statements.
        
        It breaks down datasets into smaller subsets based on feature splitting (e.g., Gini impurity or Information Gain) until leaf nodes representing final predictions or decisions are reached.
        
        Graphically, it is represented by an inverted tree, root at the top and leaves at the bottom.
        
        IMPLEMENTATION:
        
        Uses a divide-and-conquer (recursive partitioning) approach
        
        The goal in the partitioning is to increase the homogeneity of each of the partition with respect to the target variable
        
        The partitioning continues until:
        
        No more variables,
        
        No more observations
        
        Further partitioning doesn’t improve the outcome (homogeneity)
        
        At each splitting, we make a decision as to the best variable and split point
        
        The decision follows a greed approach,
        
        We look only at the current step (even when the selection is not optimal globally)
        
        Several algorithms
        
        ID3, C4.5 (Information gain)
        
        CART (Gini Index)
        
        INFORMATION GAIN
        
        Information gain uses entropy as a measure of uncertainty
        
        In the split set, entropy is zero for pure set and one for maximum disorder
        
        Log function (usually base 2) is used to transform the proportion of each category.
        
        The entropy is calculated as:
        \\(H(D) = -\sum_k^k._=._1 p_k * log_2p_k\\)
        Where \(p_k\) is relative frequency of class \(k\).
        
        A similar function for Gini Index is:
        \(H(D) = -\sum_k^k._=._1p_k * (1 - p_k)\)
        
        The gain is calculated as \(𝑔𝑎𝑖𝑛=𝐻(𝐷)−𝐻(𝐷_1, 𝐷_2)\)
        
        The gain is calculated for each split and the split with the largest information gain (largest reduction in entropy) is selected
        
        OVERFITTING
        
        Overfitting happens when the tree has too many branches to reflect every details and outliers in the trailing set
        
        Overfitting doesn’t generalize well, hence results in poor accuracy classifying unknown examples
        
        Two common approaches to avoid overfitting:
        
        Pre-pruning – stop building the tree early (before fully grown). Typical stopping conditions include stopping if all instances belong to the same class, or stopping if all variable values are same.
        
        Post-pruning – remove branches in a bottom-up fashion after a tree is built fully. If generalization error improves after trimming, replace sub-tree by a leaf node, whose class label of leaf node is determined from majority class of instance in sub-tree.
        
        PROS AND CONS
        
        Pros
        
        Computationally cheap to build
        
        Easy for humans to understand
        
        Can handle missing values
        
        Can handle irrelevant features
        
        Cons
        
        Prone to overfitting
        
        Biased trees if some classes dominate
        
        Unstable due to variations in the data
        
        Low accuracy
    - - Introduction
        
        Bayesian classification is used to predict the probability of a class membership.
        
        Despite the simplicity of the Naïve Bayes assumption of independence among explanatory variables, these classifiers have been found to perform well with relatively small amount of training data.
        
        Naïve Bayes classifiers can use continuous and categorical independent variables.
        
        Examples of the use of Bayesian classification, among many others, include those in spam detection and sentiment analysis.
        
        Naïve Bayes is one of the simplest supervised ML algorithms for classifying data
        
        The term naïve is indicative of the fact that the algorithm assumes that given the target variable, the input features are conditionally independent, which may or may not hold for the given problem
        
        Bayes rule a mathematical relation between the prior probability of a hypothesis 𝐻 and the posterior probability of the hypothesis conditional on evidence 𝐸.
        
        It can be derived from conditional probability as:
        \(𝑃(𝐻│𝐸) = 𝑃(𝐸│𝐻)∗𝑃(𝐻))⁄(𝑃(𝐸)\)
        
        It is a rule to compute inverse probability, from \(𝑃(𝐻│𝐸)\) to \(𝑃(𝐸│𝐻)\)
        
        Bayes Rule in Supervised Learning
        
        In supervised learning we have labeled training and test data
        
        The input feature 𝑋 are considered the evidence and the labels 𝑌 as outcomes
        
        Using the training data, we calculate the conditional probability of the evidence given the outcome, \(P(𝑋|𝑌)\)
        
        The goal is to estimate the probability of the outcome given the evidence, \(P(𝑌|𝑋)\)
        
        from training data, we compute \(𝑃(𝑋│𝑌)=(𝑃(𝑌│𝑋)𝑃(𝑋))⁄(𝑃(𝑌))\)
        
        from test data, we compute \(𝑃(𝑌│𝑋)=(𝑃(𝑋│𝑌)𝑃(𝑌))⁄(𝑃(𝑋))\)
        
        For two outcome problems, we calculate the probability of each outcome and the likely outcome is predicted to the one with highest probability
        
        To extend Bayes Rule with multiple features, we need to make an independence assumption, which is a simplification.
        
        The simplification assumes that given the outcome, the input features are independent, which is rarely the case in real world problems, hence the term “Naïve”
        
        For 𝑛 input variables \((𝑋_1,𝑋_2,… 𝑋_𝑛)\), Naïve Bayes can be formulated as:
        \( 𝑃(𝑌│𝑋_1,… 𝑋_𝑛 ) = \)
        \((𝑃(𝑋_1│𝑌)𝑃(𝑋_2│𝑌)…𝑃(𝑋_𝑛│𝑌)𝑃(𝑌))\) / \((𝑃(𝑋_1 )𝑃(𝑋_2 )...𝑃(𝑋_𝑛)) \)
        
        In this formulation:
        
        \(𝑃(𝑌│𝑋_1,… 𝑋_𝑛 )\) is the posterior probability
        
        \(𝑃(𝑋_1│𝑌)𝑃(𝑋_2│𝑌)…𝑃(𝑋_𝑛│𝑌)\) is likelihood of the evidence
        
        \(𝑃(𝑌)\) is the prior probability
        
        \(𝑃(𝑋_1 )𝑃(𝑋_2 )...𝑃(𝑋_𝑛)\) is the probability of the evidence
        
        Advantages of a Naive Bayes Classifier:
        
        Very simple and easy to implement
        
        Needs less training data
        
        Handles both continuous and discrete data
        
        Highly scalable with number of predictors and data points
        
        It is fast, so it can be used in real-time prediction
        
        Not sensitive to irrelevant features
        
        Example Model of a Naive Bayes Classifier
- - - - Recommender System
      - Targeted Marketting
      - Customer Segmentation
      - Fill
- - - - Multi-layered regression containing layers of weights, biases, and nonlinear functions that reside between input variables and output variables
        
        Can be used for classification and regression
        
        Used in cases where we are making predictions on unstructured data (image, text, audio)
        
        Uses stochastic gradient descent and minimize loss just like linear regression.
        
        Need "backpropagation" to untangle weights/ biases
        
        If you do not have the "activation functions" (e.g. \(ReLU\)), your hidden layers will not be productive and will perform no better than a linear regression.
        
        Logistic activation function (for e.g., sigmoid) function in the output layer demonstrates that logistic regression acts as a layer in our neural network.
        
        Sigmoid outputs a number between 0-1, so is useful for binary classification.
        
        \(Softmax\) adds up to 1.0, so is useful for multiple classifications and rescaling
        
        The Backpropagation and SGD are crucial in increases accuracy in a neural network.
        
        They work by implementing chain rule to take the error in the output node, divide it up, distribute it backward to the output and hidden weights/ biases to get the slopes with respect to each weight/bias, then take those slopes and nudge the weights/biases
- - - - Data Preparation
        Data Munging/ Wrangling/ Cleaning
        
        Train/ Test Split
        
        Feature Scaling/ Rescaling
        
        Modeling
        Build/ Execute the models
        
        Model Evaluation
        Test Models / Validate results
        
        Polishing & Presenting
        
        1 more item...