Please enable JavaScript.
Coggle requires JavaScript to display documents.
Machine Learning Mind-Map
(By: Sandarva Khanal), Neural Networks,…
Machine Learning Mind-Map
(By: Sandarva Khanal)
-
Machine Learning Pipeline (Life Cycle)
Cross Industry Standard Process for Data Mining CRISP-DM framework
-
Supervised Learning
Classification Problem
Logistic Regression

- Predicts a probability of an outcome...This in turn can be used for classification, which is predicting categories
- Offers more practicality and performance than other types of supervised machine learning.
- B0 and B1 coefficients are packaged inside an exponent resembling a linear function.
- This linear function in the exponent is known as the log-odds function.
\(Probability = 1 / (1 + e^-Linear Combination)\)
- If the probability is greater than 0.5, you predict the outcome as "true" (1); otherwise, you predict it as "false" (0).
- When the log-odds is 0.0 on the line, then the probability of the logistic curve is at 0.5.
- Fitting the Logistic Curve to a given training data is done by solving for \(B0\) and \(B1\) coefficients using maximum likelihood estimation, instead of using least squares that is used in Linear Regression.
- \(R^2\) metric indicates how well a given independent variable explains a dependent variable.
- The accuracy metric is horrendously misleading for classification problems. If a vendor, consultant, or data scientist ever tries to sell you a classification system on claims of accuracy, ask for a confusion matrix.
-
-
-
-
-
- Predicting a discrete class label
- E.g., predicting if it will rain or not, or classify email as 'spam' or 'not spam'.
-
-
-
-
Regression Problem
Linear Regression

- show a linear relationship between variables
\(y = m.x + b\)
- allows us to make predictions on new datapoint
- Another benefit is we can analyze variables for possible relationships and hypothesize that correlated variables are causal to one another.
- We should not use the linear regression to make predictions outside the range of data we have
- The error/ residual is the numeric difference between predicted y-values (from the line) and the actual y-values (from the data)
- The goal of this method is to determine the linear model that minimizes the sum of the squared errors between the observations in a dataset and those predicted by the model.
- To get to that “best fit”, we minimize the squares, or more specifically the sum of the squared residuals. The lower we can make that number, the better the fit.
- Visually, think of it as overlaying a square on each residual and each side is the length of the residual. We sum the area of all these squares
\(SSE = \sum_. (y-y')^2\)
- We have to square the residuals before summing them because:
- just adding them up without squaring will not work because the negatives will cancel out the positive;
- adding the absolute values are mathematically inconvenient they do not work well with derivatives that we are going to use later for gradient descent.
- So we need to find the derivatives of our sum of squares function with respect to \(m\) and \(b\).
-
-
-
-
- Predictions are often made for continuous quantities, such as amounts and sizes.
- E.g., predicting a house value, or predicting the amount of snow/ rain.
Decision Trees
-
INTRODUCTION:
- It is a simple non-parametric supervised learning method used for classification and regression tasks.
- The model is represented in the form of a tree structure with decision and leaf nodes
- Creates a model that predicts the value of a target variable by learning decision rules
- The tree can be translated into a rule set, if-then statements.
- It breaks down datasets into smaller subsets based on feature splitting (e.g., Gini impurity or Information Gain) until leaf nodes representing final predictions or decisions are reached.
- Graphically, it is represented by an inverted tree, root at the top and leaves at the bottom.
IMPLEMENTATION:
- Uses a divide-and-conquer (recursive partitioning) approach
- The goal in the partitioning is to increase the homogeneity of each of the partition with respect to the target variable
- The partitioning continues until:
- No more variables,
- No more observations
- Further partitioning doesn’t improve the outcome (homogeneity)
- At each splitting, we make a decision as to the best variable and split point
- The decision follows a greed approach,
- We look only at the current step (even when the selection is not optimal globally)
- Several algorithms
- ID3, C4.5 (Information gain)
- CART (Gini Index)
INFORMATION GAIN
OVERFITTING
- Overfitting happens when the tree has too many branches to reflect every details and outliers in the trailing set
- Overfitting doesn’t generalize well, hence results in poor accuracy classifying unknown examples
- Two common approaches to avoid overfitting:
- Pre-pruning – stop building the tree early (before fully grown). Typical stopping conditions include stopping if all instances belong to the same class, or stopping if all variable values are same.
- Post-pruning – remove branches in a bottom-up fashion after a tree is built fully. If generalization error improves after trimming, replace sub-tree by a leaf node, whose class label of leaf node is determined from majority class of instance in sub-tree.
PROS AND CONS
- Pros
- Computationally cheap to build
- Easy for humans to understand
- Can handle missing values
- Can handle irrelevant features
- Cons
- Prone to overfitting
- Biased trees if some classes dominate
- Unstable due to variations in the data
- Low accuracy
Naive Bayes
Introduction
- Bayesian classification is used to predict the probability of a class membership.
- Despite the simplicity of the Naïve Bayes assumption of independence among explanatory variables, these classifiers have been found to perform well with relatively small amount of training data.
- Naïve Bayes classifiers can use continuous and categorical independent variables.
- Examples of the use of Bayesian classification, among many others, include those in spam detection and sentiment analysis.
- Naïve Bayes is one of the simplest supervised ML algorithms for classifying data
- The term naïve is indicative of the fact that the algorithm assumes that given the target variable, the input features are conditionally independent, which may or may not hold for the given problem
- Bayes rule a mathematical relation between the prior probability of a hypothesis 𝐻 and the posterior probability of the hypothesis conditional on evidence 𝐸.
- It can be derived from conditional probability as:
\(𝑃(𝐻│𝐸) = 𝑃(𝐸│𝐻)∗𝑃(𝐻))⁄(𝑃(𝐸)\)
- It is a rule to compute inverse probability, from \(𝑃(𝐻│𝐸)\) to \(𝑃(𝐸│𝐻)\)
Bayes Rule in Supervised Learning
- In supervised learning we have labeled training and test data
- The input feature 𝑋 are considered the evidence and the labels 𝑌 as outcomes
- Using the training data, we calculate the conditional probability of the evidence given the outcome, \(P(𝑋|𝑌)\)
- The goal is to estimate the probability of the outcome given the evidence, \(P(𝑌|𝑋)\)
- from training data, we compute \(𝑃(𝑋│𝑌)=(𝑃(𝑌│𝑋)𝑃(𝑋))⁄(𝑃(𝑌))\)
- from test data, we compute \(𝑃(𝑌│𝑋)=(𝑃(𝑋│𝑌)𝑃(𝑌))⁄(𝑃(𝑋))\)
- For two outcome problems, we calculate the probability of each outcome and the likely outcome is predicted to the one with highest probability
- To extend Bayes Rule with multiple features, we need to make an independence assumption, which is a simplification.
- The simplification assumes that given the outcome, the input features are independent, which is rarely the case in real world problems, hence the term “Naïve”
- For 𝑛 input variables \((𝑋_1,𝑋_2,… 𝑋_𝑛)\), Naïve Bayes can be formulated as:
\( 𝑃(𝑌│𝑋_1,… 𝑋_𝑛 ) = \)
\((𝑃(𝑋_1│𝑌)𝑃(𝑋_2│𝑌)…𝑃(𝑋_𝑛│𝑌)𝑃(𝑌))\) / \((𝑃(𝑋_1 )𝑃(𝑋_2 )...𝑃(𝑋_𝑛)) \)
- In this formulation:
- \(𝑃(𝑌│𝑋_1,… 𝑋_𝑛 )\) is the posterior probability
- \(𝑃(𝑋_1│𝑌)𝑃(𝑋_2│𝑌)…𝑃(𝑋_𝑛│𝑌)\) is likelihood of the evidence
- \(𝑃(𝑌)\) is the prior probability
- \(𝑃(𝑋_1 )𝑃(𝑋_2 )...𝑃(𝑋_𝑛)\) is the probability of the evidence
Advantages of a Naive Bayes Classifier:
- Very simple and easy to implement
- Needs less training data
- Handles both continuous and discrete data
- Highly scalable with number of predictors and data points
- It is fast, so it can be used in real-time prediction
- Not sensitive to irrelevant features
Example Model of a Naive Bayes Classifier
-
-
- Machine is trained using historical data
- Useful when datasets are labeled (i.e., output is known or data is tagged with correct answer).
- uses a self-correcting feedback loop.
Semi-Supervised Learning
- some examples include a supervision target, but others do not
- Machine is trained using historical data
- Useful when datasets are not fully labeled (all outputs are not known).
- combines elements of supervised and unsupervised learning
Reinforcement Learning
- Model uses synchronous learning to reward/penalize right/wrong decisions and prevent future mishaps.
- Model learns in an interactive environment by trial and error using feedback from its own actions and experience.
- Reinforcement Learning is concurrently applied in the decision process as a result of series of actions.
-
-
-
-
-
Unsupervised Learning
- Machine is trained using historical data
- Useful when datasets are unlabeled.
- Without explicit instructions/ guidance, model attempts to find structure in the data.
-
-
-
-
-
-
Long short-term memory (LSTM)
for predicting time series, or forecasting
-
-
-
-
-
-
Machine Learning Pipeline (Life Cycle)
Cross Industry Standard Process for Data Mining CRISP-DM framework
-
-