Please enable JavaScript.
Coggle requires JavaScript to display documents.
Practical Machine Learning in R - Coggle Diagram
Practical Machine Learning in R
Part 1: Getting Started
Ch 1 What is Machine Learning
Machine Learning Techniques
Unsupervised Learning: help us discover new labels, or groupings, of the observations in our dataset
market basket: association rules
market segmentation: clustering
Introducing Algorithm: a set of steps that you follow when carrying out a process
Artificial Intelligence, Machine Learning, and Deep Learning
Supervised Learning: help us assign known labels to new observations
Reinforcement Learning: learn based on trial and error, similar to the way that a young child learns the rules of a home by being rewarded and punished.
Model Selection
Classification
techniques: train models that allow us to predict membership in a category / Supervised
Regression
: allow us to predict a numeric result / Supervised
Similarity
learning: help us discover the ways that observations in our dataset resemble and differ from each other / Unsupervised or Supervised
Model Evaluation
Classification Errors
False positive errors / Type I errors <-> True positive
False positive rate (FPR) = FP / (FP + TP)
False negative error / Type II errors <-> True negative
False negative rate (FNR)
Regression Errors
Residual value
residual sum of squares (like standard deviation?)
Types of Errors (in the world of machine learning)
Bias
: When the model type that we choose is unable to it our dataset well, the resulting error is bias
Variance
: the dataset that we use to train our
machine learning model is not representative of the entire universe of possible data
Irreducible error
, or noise
Underfitting
: high bias, low variance
Overfitting
: low bias, high variance
Partitioning Datasets
Test dataset
: assess the performance of model
Validation dataset
: help develop the model in an iterative process, adjusting the parameters of the model during each iteration
Holdout method: set aside portions of the original dataset for validation and testing purposes at the beginning of the model development process
Cross-Validation Methods: particularly useful for smaller datasets where it is undesirable to reserve a portion of the dataset for validation purposes
Ch 2 Introduction to R and RStudio
R
Interpreted language
: the code that you write is stored in a doc- ument called a
script
, and this script is the code that is directly executed by the system processing the code <>
compiled language
,
compiler
Data types in R
logical: flags
(Defualt) numeric: decimal number
Cf. double: shorter for floating point
integer
character
factor: categorical values /
level
ordered factor
Vector
coercion: If you attempt to create a vector with varying data types, R will force them all to be the same data type
Ch 3 Managing Data
tidyverse
: facilitate the entire analytics process
readr
: importing data
tibble
: storing data
dplyr
: manipulating data
ggplot2
: visualizing data
tidyr
: transforming data
purrr
: functional programming
stringr
: manipulating
string
lubridate
: manipulating
dates and times
Data Collection
: process of identifying and acquiring the data needed for the machine learning process
Key Considerations
Collecting
Ground Truth
Data: can come with an existing label based on a prior event, such as whether a bank customer defaulted on a loan or not, or can require that a label be assigned to it by a domain expert
Data Relevance
Quantity of Data
: Understanding the strengths and weaknesses of each approach provides us with the guidance needed to determine how much data is enough for the learning task.
Ethics
Importing the Data
Reading Comma-Delimited Files
Reading Other Delimited Files
read a tab-delimited (TSV) file, we use the read_tsv() function
read a pipe-delimited file, we would need to set delim = "|" for the read_delim() function
Data Exploration: we often need to describe the characteristics of the data with the use of statistical summaries and visualizations
Describing the Data
instance
: a row of data. We will sometimes refer to instances as
records
,
examples
, or
observations
.
feature
: a column of data. We will sometimes refer to features as
columns
or
variables
.
discrete feature
continuous feature
independent variables
: the feature
s
that
describe
our data
dependent variable
: the feature that
represents
the label
For classification problems, the dependent variable is also referred to as the
class
, and for regression problems, it is referred to as the
response
.
Dimensionality
Data sparsity and density
Resolution
Descriptive statistics or summary statistics
frequency
of a feature value tells us how often the value occurs, typically used to describe categorical data
the
mode
of the feature tells us which value occurs the most for that feature, typically used to describe categorical data
mean
and
median
, for continuous data
The median of a set of values is sometimes preferred over the mean because it is not impacted as much by a small proportion of extremely large or small values
summary(): shows only the top six feature values in terms of count <->
table()
arrange: sorting rows
mutate: modifying variables
select: feature
filter: observation/raw
summarize
pipe
: control the logical low of our code
Visualizing the Data
Comparison
box plots
grammar of graphics
Relationship
scatter plots
Distribution
histogram
Composition
stacked bar charts
pie charts
Data Preparation
Cleaning the Data
Missing value
Imputation: the use of a systematic approach to ill in missing data using the most probable substitute values
Random imputation
Match-based imputation
hot-deck imputation
cold-deck imputation
Distribution-based imputation
Predictive imputation
Mean or median imputation
Noise: the random component of measurement error
Smoothing with bin means
: sorting and grouping the data into a defined number of bins and replacing each value within a bin with the mean value for the bin
Smoothing with bin boundaries
: replace the values by either one of the bin boundaries based on proximity
Smoothing by Clustering
Smoothing by regression
: to use a fitted regression line as a substitute for the original data
Outliers
Class Imbalance
minority class
majority class
accuracy paradox
Transforming the Data
standardization or normalization
Decimal scaling
: moving the position of the decimal point on a set of values
z-score,
or
zero mean normalization
: the approach results in normalized values that have a mean of 0 and a standard deviation of 1
min-max normalization
: transform the original data from the measured units to a new interval defined by user-specified lower and upper bound
Log Transformation
: For skewed distributions and data with values that range over several orders of magnitude, the log transformation is usually more suitable. With log trans- formation, we replace the values of the original data by the logarithm.
Discretization: treating continuous features as if they are categorical
dichotomization: discretize continuous features into binary values by coding them in terms of how they compare to a reference cutoff value
Dummy Coding: use of dichotomous (binary) numeric values to represent categorical features
full dummy coding, or one-hot encoding
baseline: The choice of which value to use as the baseline is often arbitrary or dependent on the question that a user is trying to answer.
Reducing the Data
sampling
: the process of selecting a subset of the rows in the dataset as a proxy for the whole <-> original dataset:
population
simple random sampling
random sampling without replacement
sample set vector: a list of integer values that represent the row numbers in the original dataset
random sampling with replacement
stratified random sampling: ensures that the distribution of feature values within the sample matches the distribution of values for the same feature in the overall population
strata
Dimensionality Reduction: the reduction in the number of features (dimensions) of a dataset prior to training a model
feature selection, or variable subset selection
feature extraction, or feature projection
principal component analysis (PCA)
non-negative matrix factorization (NMF)
Part 2: Regression
Ch4 Linear Regression
Relationships between variables
Correlation
: provides a single numeric value of the relationship between the variables, which is known as the
correlation coefficient
Pearson’s correlation coefficient
: range from -1 to +1, with larger absolute values indicating a strong relationship between variables and smaller absolute numbers indicating a weak relationship. Note as pho.
view absolute coecient values of 0 to 0.3 as nonexistent to weak, above 0.3 to 0.5 as moderate, and above 0.5 as strong
The
standard deviation
of a variable is a measurement of the amount of variability present. Note the SD as sigma.
The
covariance
between two variables measures their joint variability.
The correlation between two variables is a normalized version of covariance.
Regression
Response Variable (Y)
Predictors (X)
Coefficient (Beta)
Simple Linear Regression: a single dependent variable to predict the dependent variable
Ordinary Least Squares Method
residual sum of squares or sum of squared errors
Simple Linear Regression Model
Evaluating the Model
The difference between our predictions and the actual values is known as the error or
residual
.
Diagnostics
Residual Standard Error
: a measure of
lack of fit
for a model
The degrees of freedom value provides the number of data points in our model that are variable
Multiple and Adjusted R-squared
: independent of the scale of Y and takes the form of a proportion with values ranging from 0 to 1
coefficient of determination, explains how well our model explains the values of the dependent variable.
F-statistic
: a statistical test of whether there exists a relationship between the predictor and the response variables.
Multiple Linear Regression
The Multiple Linear Regression Model
Evaluating the Model
Residual Diagnostics
Zero Mean of Residuals
Normality of Residuals
Homoscedasticity of Residuals
the Breusch-Pagan statistical test
Residual plot
Residual Autocorrelation: the correlation of a variable with itself at different points in time
the Durbin-Watson (DW) test
Influential Point Analysis
Cook’s distance
measures the effect of removing an observation from a model.
Multicollinearity: a phenomenon that occurs when two or more predictor variables are highly correlated with each other
variance inflation factor (VIF): the measure of how much the variance of the estimated regression coefficient for that variable is inflated by the existence of correlation among the predictor variables in the model.
Tolerance
can be thought of as the percent of variance in predictor k that cannot be accounted for by other predictors.
One approach is to drop one of the problematic variables from the model, while the other approach is to combine the collinear predictors into a single variable
Improving the Model
Considering Nonlinear Relationships
polynomial regression
Considering Categorical Variables
Considering Interactions Between Variables
Interaction effect: There are situations where two variables have a combined effect on the response.
Selecting the Important Variables
forward selection
: begin with the intercept and then create several simple linear regression models based on the intercept and each individual predictor
backward selection
: involves creating a model with all our predictors and then removing the predictor that is
least statistically significant (based on the p-value)
mixed selection
Strengths and Weaknesses
Ch5 Logistic Regression
Classification
logistic regression: models the probability of a particular response value
sigmoid curve
Maximum likelihood estimation (MLE): identify the values for 0 and 1 that best approximate the relationship between X and Y
odds or odds ratio: the likelihood (or probability) that the event will occur expressed as a proportion of the likelihood that the event will not occur. X / X - 1
In sports, instead of stating the probability of winning, people will often talk about the odds of winning
Binomial Logistic Regression Model
Dealing with Missing Data
imputation
dummy variable
Dealing with Outliers
The principle behind the rule is that any value that is larger or less than 1.5 times the interquartile range (IQR) is labeled as an outlier and should be removed from the data
A
symmetric distribution
is one where the data is evenly balanced on both sides of the mean (or center point)
For
left skewed (or negative) distributions
, the mean is less than the median.
For
right skewed (or positive) distributions
, the tail is longer on the right side than on the left and the mean is larger than the median
Splitting the Data
Dealing with Class Imbalance
synthetic minority oversampling technique (SMOTE): This technique works by creating new synthetic samples from the
minority class to resolve the imbalance
Training a Model
Part 3: Classification
Chapter 6 k-Nearest Neighbors
Chapter 7 Naive Bayes
Chapter 8 Decision Tree
Recursive Partitioning
Entropy
a quantiication of the level of impurity or randomness that exists within a partition
Information Gain
the decision tree algorithm would evaluate all the features and their corresponding values to determine which split would result in the largest reduction in entropy.
Weakness: It tends to be biased toward features with a
high number of distinct values.
Gain Ratio: a modification of information gain that reduces its bias on highly branching features by taking into account the number and size of branches when choosing a feature
Gini Impurity
Part 4: Unsupervised Learning
Chapter 11 Discovering Patterns with Association Rules
Market Basket Analysis / Affinity Analysis
Market basket data
: data provides a wealth of information about customer behavior and actionable insight to businesses
Any combination of items that could be purchased together within a transaction is known as an
itemset
.
an itemset does not always refer to all the items that were purchased by a customer
any combination of items that could have been purchased together
Association Rule
: the description of the relationship between items and itemsets
IF-THEN format
antecedent
: left side;
consequent
: right side
association rules allow for one or more items in the antecedent, but only one item in the consequent
Association rules can also have a length of 1. Such a rule has a consequent but no antecedent
Each rule that is generated has to be evaluated by a user for qualitative usefulness.
Actionable
: These are rules that provide clear and useful insights that can be acted upon.
Trivial
: These are rules that provide insight that is already well-known by those familiar with the domain.
Inexplicable
: These are rules that defy rational explanation, need more research to understand, and do not suggest a clear course of action
Identifying Strong Rules
The frequency of an itemset is measured using a metric known as
support
or
coverage
. The support of an itemset is deined as the fraction of transactions within the dataset that contain the itemset.
The measure we use to quantify the conditional probability that a transaction selected at random contains the itemset in the consequent given that the transaction contains the itemset in the antecedent is known as the
confidence
or
accuracy
.
The increased or decreased likelihood of both the antecedent and the consequent occurring together compared to the typical rate of occurrence of the consequent alone. This measure is known as the
lift
.
The Apriori Algorithm
: to minimize the computational cost of this process, The Apriori Algorithm is used to limit the number of itemsets generate
anti-monotone property of support: if an itemset is infrequent, then its supersets are infrequent as well