Please enable JavaScript.

Coggle requires JavaScript to display documents.

Practical Machine Learning in R - Coggle Diagram

- - - - Unsupervised Learning: help us discover new labels, or groupings, of the observations in our dataset
        
        market basket: association rules
        
        market segmentation: clustering
      - Introducing Algorithm: a set of steps that you follow when carrying out a process
      - Artificial Intelligence, Machine Learning, and Deep Learning
      - Supervised Learning: help us assign known labels to new observations
      - Reinforcement Learning: learn based on trial and error, similar to the way that a young child learns the rules of a home by being rewarded and punished.
    - - Classification techniques: train models that allow us to predict membership in a category / Supervised
      - Regression: allow us to predict a numeric result / Supervised
      - Similarity learning: help us discover the ways that observations in our dataset resemble and differ from each other / Unsupervised or Supervised
    - - Classification Errors
        
        False positive errors / Type I errors <-> True positive
        
        False positive rate (FPR) = FP / (FP + TP)
        
        False negative error / Type II errors <-> True negative
        
        False negative rate (FNR)
      - Regression Errors
        
        Residual value
        
        residual sum of squares (like standard deviation?)
      - Types of Errors (in the world of machine learning)
        
        Bias: When the model type that we choose is unable to it our dataset well, the resulting error is bias
        
        Variance: the dataset that we use to train our
        machine learning model is not representative of the entire universe of possible data
        
        Irreducible error, or noise
        
        Underfitting: high bias, low variance
        
        Overfitting: low bias, high variance
      - Partitioning Datasets
        
        Test dataset: assess the performance of model
        
        Validation dataset: help develop the model in an iterative process, adjusting the parameters of the model during each iteration
        
        Holdout method: set aside portions of the original dataset for validation and testing purposes at the beginning of the model development process
        
        Cross-Validation Methods: particularly useful for smaller datasets where it is undesirable to reserve a portion of the dataset for validation purposes
  - - - Interpreted language: the code that you write is stored in a doc- ument called a script, and this script is the code that is directly executed by the system processing the code <> compiled language, compiler
      - Data types in R
        
        logical: flags
        
        (Defualt) numeric: decimal number
        
        Cf. double: shorter for floating point
        
        integer
        
        character
        
        factor: categorical values / level
        
        ordered factor
      - Vector
        
        coercion: If you attempt to create a vector with varying data types, R will force them all to be the same data type
  - - - readr: importing data
      - tibble: storing data
      - dplyr: manipulating data
      - ggplot2: visualizing data
      - tidyr: transforming data
      - purrr: functional programming
      - stringr: manipulating string
      - lubridate: manipulating dates and times
    - - Collecting Ground Truth Data: can come with an existing label based on a prior event, such as whether a bank customer defaulted on a loan or not, or can require that a label be assigned to it by a domain expert
      - Data Relevance
      - Quantity of Data: Understanding the strengths and weaknesses of each approach provides us with the guidance needed to determine how much data is enough for the learning task.
      - Ethics
    - - Reading Comma-Delimited Files
      - Reading Other Delimited Files
        
        read a tab-delimited (TSV) file, we use the read_tsv() function
        
        read a pipe-delimited file, we would need to set delim = "|" for the read_delim() function
    - - Describing the Data
        
        instance: a row of data. We will sometimes refer to instances as records, examples, or observations.
        
        feature: a column of data. We will sometimes refer to features as columns or variables.
        
        discrete feature
        
        continuous feature
        
        independent variables: the features that describe our data
        
        dependent variable: the feature that represents the label
        
        For classification problems, the dependent variable is also referred to as the class, and for regression problems, it is referred to as the response.
        
        Dimensionality
        
        Data sparsity and density
        
        Resolution
        
        Descriptive statistics or summary statistics
        
        frequency of a feature value tells us how often the value occurs, typically used to describe categorical data
        
        the mode of the feature tells us which value occurs the most for that feature, typically used to describe categorical data
        
        mean and median, for continuous data
        
        The median of a set of values is sometimes preferred over the mean because it is not impacted as much by a small proportion of extremely large or small values
        
        summary(): shows only the top six feature values in terms of count <-> table()
        
        arrange: sorting rows
        
        mutate: modifying variables
        
        select: feature
        
        filter: observation/raw
        
        summarize
        
        pipe: control the logical low of our code
      - Visualizing the Data
        
        Comparison
        
        box plots
        
        grammar of graphics
        
        Relationship
        
        scatter plots
        
        Distribution
        
        histogram
        
        Composition
        
        stacked bar charts
        
        pie charts
    - - Cleaning the Data
        
        Missing value
        
        Imputation: the use of a systematic approach to ill in missing data using the most probable substitute values
        
        Random imputation
        
        Match-based imputation
        
        hot-deck imputation
        
        cold-deck imputation
        
        Distribution-based imputation
        
        Predictive imputation
        
        Mean or median imputation
        
        Noise: the random component of measurement error
        
        Smoothing with bin means: sorting and grouping the data into a defined number of bins and replacing each value within a bin with the mean value for the bin
        
        Smoothing with bin boundaries: replace the values by either one of the bin boundaries based on proximity
        
        Smoothing by Clustering
        
        Smoothing by regression: to use a fitted regression line as a substitute for the original data
        
        Outliers
        
        Class Imbalance
        
        minority class
        
        majority class
        
        accuracy paradox
      - Transforming the Data
        
        standardization or normalization
        
        Decimal scaling: moving the position of the decimal point on a set of values
        
        z-score, or zero mean normalization: the approach results in normalized values that have a mean of 0 and a standard deviation of 1
        
        min-max normalization: transform the original data from the measured units to a new interval defined by user-specified lower and upper bound
        
        Log Transformation: For skewed distributions and data with values that range over several orders of magnitude, the log transformation is usually more suitable. With log trans- formation, we replace the values of the original data by the logarithm.
        
        Discretization: treating continuous features as if they are categorical
        
        dichotomization: discretize continuous features into binary values by coding them in terms of how they compare to a reference cutoff value
        
        Dummy Coding: use of dichotomous (binary) numeric values to represent categorical features
        
        full dummy coding, or one-hot encoding
        
        baseline: The choice of which value to use as the baseline is often arbitrary or dependent on the question that a user is trying to answer.
      - Reducing the Data
        
        sampling: the process of selecting a subset of the rows in the dataset as a proxy for the whole <-> original dataset: population
        
        simple random sampling
        
        random sampling without replacement
        
        sample set vector: a list of integer values that represent the row numbers in the original dataset
        
        random sampling with replacement
        
        stratified random sampling: ensures that the distribution of feature values within the sample matches the distribution of values for the same feature in the overall population
        
        strata
        
        Dimensionality Reduction: the reduction in the number of features (dimensions) of a dataset prior to training a model
        
        feature selection, or variable subset selection
        
        feature extraction, or feature projection
        
        principal component analysis (PCA)
        
        non-negative matrix factorization (NMF)
- - - - Correlation: provides a single numeric value of the relationship between the variables, which is known as the correlation coefficient
        
        Pearson’s correlation coefficient: range from -1 to +1, with larger absolute values indicating a strong relationship between variables and smaller absolute numbers indicating a weak relationship. Note as pho.
        
        view absolute coecient values of 0 to 0.3 as nonexistent to weak, above 0.3 to 0.5 as moderate, and above 0.5 as strong
        
        The standard deviation of a variable is a measurement of the amount of variability present. Note the SD as sigma.
        
        The covariance between two variables measures their joint variability.
        
        The correlation between two variables is a normalized version of covariance.
      - Regression
        
        Response Variable (Y)
        
        Predictors (X)
        
        Coefficient (Beta)
    - - Ordinary Least Squares Method
        
        residual sum of squares or sum of squared errors
        
        Simple Linear Regression Model
        
        Evaluating the Model
        
        The difference between our predictions and the actual values is known as the error or residual.
        
        Diagnostics
        
        Residual Standard Error: a measure of
        lack of fit for a model
        
        The degrees of freedom value provides the number of data points in our model that are variable
        
        Multiple and Adjusted R-squared: independent of the scale of Y and takes the form of a proportion with values ranging from 0 to 1
        
        coefficient of determination, explains how well our model explains the values of the dependent variable.
        
        F-statistic: a statistical test of whether there exists a relationship between the predictor and the response variables.
    - - The Multiple Linear Regression Model
      - Evaluating the Model
        
        Residual Diagnostics
        
        Zero Mean of Residuals
        
        Normality of Residuals
        
        Homoscedasticity of Residuals
        
        the Breusch-Pagan statistical test
        
        Residual plot
        
        Residual Autocorrelation: the correlation of a variable with itself at different points in time
        
        the Durbin-Watson (DW) test
        
        Influential Point Analysis
        
        Cook’s distance measures the effect of removing an observation from a model.
        
        Multicollinearity: a phenomenon that occurs when two or more predictor variables are highly correlated with each other
        
        variance inflation factor (VIF): the measure of how much the variance of the estimated regression coefficient for that variable is inflated by the existence of correlation among the predictor variables in the model.
        
        Tolerance can be thought of as the percent of variance in predictor k that cannot be accounted for by other predictors.
        
        One approach is to drop one of the problematic variables from the model, while the other approach is to combine the collinear predictors into a single variable
      - Improving the Model
        
        Considering Nonlinear Relationships
        
        polynomial regression
        
        Considering Categorical Variables
        
        Considering Interactions Between Variables
        
        Interaction effect: There are situations where two variables have a combined effect on the response.
        
        Selecting the Important Variables
        
        forward selection: begin with the intercept and then create several simple linear regression models based on the intercept and each individual predictor
        
        backward selection: involves creating a model with all our predictors and then removing the predictor that is
        least statistically significant (based on the p-value)
        
        mixed selection
  - - - logistic regression: models the probability of a particular response value
        
        sigmoid curve
        
        Maximum likelihood estimation (MLE): identify the values for 0 and 1 that best approximate the relationship between X and Y
        
        odds or odds ratio: the likelihood (or probability) that the event will occur expressed as a proportion of the likelihood that the event will not occur. X / X - 1
        
        In sports, instead of stating the probability of winning, people will often talk about the odds of winning
        
        Binomial Logistic Regression Model
        
        Dealing with Missing Data
        
        imputation
        
        dummy variable
        
        Dealing with Outliers
        
        The principle behind the rule is that any value that is larger or less than 1.5 times the interquartile range (IQR) is labeled as an outlier and should be removed from the data
        
        A symmetric distribution is one where the data is evenly balanced on both sides of the mean (or center point)
        
        For left skewed (or negative) distributions, the mean is less than the median.
        
        For right skewed (or positive) distributions, the tail is longer on the right side than on the left and the mean is larger than the median
        
        Splitting the Data
        
        Dealing with Class Imbalance
        
        synthetic minority oversampling technique (SMOTE): This technique works by creating new synthetic samples from the
        minority class to resolve the imbalance
        
        Training a Model
- - - - Entropy
        
        a quantiication of the level of impurity or randomness that exists within a partition
      - Information Gain
        
        the decision tree algorithm would evaluate all the features and their corresponding values to determine which split would result in the largest reduction in entropy.
        
        Weakness: It tends to be biased toward features with a
        high number of distinct values.
        
        Gain Ratio: a modification of information gain that reduces its bias on highly branching features by taking into account the number and size of branches when choosing a feature
      - Gini Impurity
- - - - Market basket data: data provides a wealth of information about customer behavior and actionable insight to businesses
      - Any combination of items that could be purchased together within a transaction is known as an itemset.
        
        an itemset does not always refer to all the items that were purchased by a customer
        
        any combination of items that could have been purchased together
      - Association Rule: the description of the relationship between items and itemsets
        
        IF-THEN format
        
        antecedent: left side; consequent: right side
        
        association rules allow for one or more items in the antecedent, but only one item in the consequent
        
        Association rules can also have a length of 1. Such a rule has a consequent but no antecedent
        
        Each rule that is generated has to be evaluated by a user for qualitative usefulness.
        
        Actionable: These are rules that provide clear and useful insights that can be acted upon.
        
        Trivial: These are rules that provide insight that is already well-known by those familiar with the domain.
        
        Inexplicable: These are rules that defy rational explanation, need more research to understand, and do not suggest a clear course of action
        
        Identifying Strong Rules
        
        The frequency of an itemset is measured using a metric known as support or coverage. The support of an itemset is deined as the fraction of transactions within the dataset that contain the itemset.
        
        The measure we use to quantify the conditional probability that a transaction selected at random contains the itemset in the consequent given that the transaction contains the itemset in the antecedent is known as the confidence or accuracy.
        
        The increased or decreased likelihood of both the antecedent and the consequent occurring together compared to the typical rate of occurrence of the consequent alone. This measure is known as the lift.
        
        The Apriori Algorithm: to minimize the computational cost of this process, The Apriori Algorithm is used to limit the number of itemsets generate
        
        anti-monotone property of support: if an itemset is infrequent, then its supersets are infrequent as well