Statistics Boot Camp

R Programming

Basic Data Structures

Homogeneous
(1 data type)

Heterogeneous
(>1 data type)

Vector: 1D
(1.2 p21-37)

Matrix: 2D
(1.2 p51-55)

Array: Multi D

List: 1D
(1.2 p42-47)

Data Frame: 2D
(1.2 p58-70)

Basic Data Types

Factor
(1.3 p4-7)

Numeric - (10.5, 55, 787)

Integer - (1L, 55L, 100L, where the letter "L" declares this as an integer)

Complex - (9 + 3i, where "i" is the imaginary part)

Character (a.k.a. string) - ("k", "R is exciting", "FALSE", "11.5")

Logical (a.k.a. boolean) - (TRUE or FALSE)

For categorical data

Functions

*apply()
(1.4 p13-18)

Iterating over data

dplyr package
(1.4 p21-33)

Managing data

Data Visualization

Bar Chart
(2.1 p9-12)

Pie Chart
(2.1 p13-15)

Line Chart
(2.1 p16-19)

Histogram
(2.1 p20-22)

Scatter Plot
(2.1 p23-26)

Box Plot
(2.1 p27-29)

Sampling

Selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population

Techniques

Simple Random Sampling
(2.2 p12)

Systematic Sampling
(2.2 p13)

Stratified Random Sampling
(2.2 p14)

Cluster Sampling
(2.2 p15)

Confidence Interval
(2.3 p18-28)

Standard Error

Continuous Probability Distributions

Normal Distribution
(2.3 p3-8)

Sampling Distribution
(2.2 p24)

Normality Test
(2.3 p29-31)

https://www.statology.org/sampling-methods/

T-Distribution

Binomial Distribution

Poisson Distribution

Exponential Distribution

Central Limit Theorem
(2.3 p11-17)

A probability distribution of a statistic obtained from a large number of samples drawn from a specific population

Multiple samples creates sampling distribution

States that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger, even if the population distribution is not normal.

This fact holds especially true for sample sizes over 30.

Confidence Interval = (point estimate) +/- (critical value)*(standard error)
where:
critical value from Z-table (Normal distribution table) if population SD is known and sample size > 30
critical value from t-table if population SD is unknown or sample size < 30

Range of values that is likely to contain a population parameter with a certain confidence level

Confidence Level

Probability that a population parameter will fall between the Lower and Upper Confidence Limits

Mean
(2.2 p22-25)

Standard Error = SD / sqrt(n)
where:
SD = standard deviation
n = sample size

Proportion

Standard Error = sqrt( ((P)(1-P)) / n)
where:
P = population proportion
n = sample size

Determine whether sample data has been drawn from a normally distributed population

Use histogram and Q-Q plot to conduct informal Normality Test

Hypothesis Testing

Five steps

State the hypotheses
(3.1 p15-20)

Determine a significance level
(3.1 p16-23)

Identify the test statistic

Reject or fail to reject the null hypothesis

Interpret the results

Decision Errors
(3.1 p22-23)

Type I error
(False Positive)

Type II error
(False Negative)

Probability of a Type I error = alpha, α (significance level)

Reject the null hypothesis when it is actually true

Fail to reject the null hypothesis when it is actually false

Probability of a Type II error = Beta, β

Example

Null: No COVID
Alternate: COVID positive

Type I error: Tested COVID positive (reject null) when no COVID

Type II error: Tested negative (do not reject null) when is COVID positive

Null Hypothesis

Alternative Hypothesis

A precise and testable statement

Assumes no relationship between the variables

A statement that is being tested against the null hypothesis

Probability of rejecting the null hypothesis when it is true

Variable of interest (y) is Continuous of Categorical?

Continuous

Categorical or Proportional

1 Sample
(3.1 p24-53)

Population SD known and sample size > 30

Population SD unknown or sample size < 30

Single Sample Z-Test

Single Sample T-Test

2 Samples
(3.1 p54)

One-Tailed and Two-Tailed Tests

One-Tailed

Two-Tailed

Hypothesis involves making a “greater than” or “less than ” statement

Hypothesis involves making an “equal to” or “not equal to” statement.

Independent Samples
(3.1 p55-59)

Paired Samples
(3.1 p60-62)

Population SD known and sample size > 30

Paired Z-Test

Population SD known and sample size > 30

2 Samples Z-Test

Population SD unknown or sample size < 30

2 Samples T-Test

Population SD unknown or sample size < 30

Paired T-Test

Many Independent Samples

2 Independent Variables

1 Independent Variable

One-way ANOVA
(3.2 p9-28)

Two-way ANOVA
(3.2 p29-43)

Hypotheses

Null

Alternate

All the population means are equal

At least one population mean is different from the rest

https://www.statology.org/one-way-anova/

Multiple Comparison Tests
(3.2 p24-27)

To determine which which population means are different

Tukey HSD
Bonferroni Method

https://www.statology.org/two-way-anova/

3 p-values, 2 for the independent variables, 1 for the interaction between the 2 variables

1 Sample and 2 Options

One Proportion Z-Test
(3.1 p64-66)

Regression

Variable of interest (y) is Binary

Variable of interest (y) is Continuous

Linear Regression

Steps

Data Exploration

Check correlation
(4.1 p9-20)

Build model

Evaluation of model

Do residuals analysis

Ensure variables are significant

Model testing using test data

Pearson correlation coefficient (r) > 0.7 for correlation to be considered strong

Check Goodness of fit of model
(4.1 p28-32)

p-value for coefficients < 0.05

Linearity using Residuals VS Fitted Plot
(4.1 p43)

Normality

Equal variance using Scale-Location Plot
(4.1 p45)

Outliers check using Residuals vs Leverage Plot
(4.1 p46)

Q-Q Plot
(4.1 p44)

Shapiro-Wilk Test p-value < 0.05
(4.1 p41)

Check multicollinearity

All variables are Continuous

Correlation matrix
(4.2 p6)

VIF
(4.2 p7)

If VIF for a variable > 5, the variable is highly correlated with the other variables

If multiple variables with VIF > 5, drop the variable with the highest VIF, run the multi regression, get the VIF and drop the next highest VIF.

Ensure model is better than null model

p-value for model < 0.05

Algorithms to improve regression accuracy

Variables selection methods

Backward elimination
(4.2 p24)

Stepwise
(4.2 p24)

Logistic Regression
(5.2 p30-41)

eliminate variables which are highly correlated keeping the one with highest correlation with y

Contains Categorical variables

Continuous-Categorical

Perform one-hot encoding
(5.1 p6-14)

1 Categorical

Categorical-Categorical

Chi-Square
(5.1 p17)

one-way ANOVA
(5.1 p16)

2 Categorical

two-way ANOVA

Do scatter plot of Y vs X

Quadratic Regression
(5.2 p10-24)

Check interaction effects
(5.1 p21-24)

If the predictor variables are highly correlated

how the y reacts to different levels or level combinations of predictors (Xs)

interaction is significant

when interaction p-value < 0.05

predictor must stay in the model, even if the individual p-value is not significant

Groups must be mutually
exclusive and exhaustive

Check linear relationship between X and Y

Check distribution of Y

Identify and deal with outliers

Transform if not normal

Check for outliers

If non-linear, can try to perform transformation to become linear
(5.2 p9)

Adjusted R-squared > 0.7 is a good model fit

Fit linear regression
(4.1 p24-27)

Split to train and test dataset

using interaction plots

Forward selection
(4.2 p23)

Cross-Validation