Statistics Boot Camp
R Programming
Basic Data Structures
Homogeneous
(1 data type)
Heterogeneous
(>1 data type)
Vector: 1D
(1.2 p21-37)
Matrix: 2D
(1.2 p51-55)
Array: Multi D
List: 1D
(1.2 p42-47)
Data Frame: 2D
(1.2 p58-70)
Basic Data Types
Factor
(1.3 p4-7)
Numeric - (10.5, 55, 787)
Integer - (1L, 55L, 100L, where the letter "L" declares this as an integer)
Complex - (9 + 3i, where "i" is the imaginary part)
Character (a.k.a. string) - ("k", "R is exciting", "FALSE", "11.5")
Logical (a.k.a. boolean) - (TRUE or FALSE)
For categorical data
Functions
*apply()
(1.4 p13-18)
Iterating over data
dplyr package
(1.4 p21-33)
Managing data
Data Visualization
Bar Chart
(2.1 p9-12)
Pie Chart
(2.1 p13-15)
Line Chart
(2.1 p16-19)
Histogram
(2.1 p20-22)
Scatter Plot
(2.1 p23-26)
Box Plot
(2.1 p27-29)
Sampling
Selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population
Techniques
Simple Random Sampling
(2.2 p12)
Systematic Sampling
(2.2 p13)
Stratified Random Sampling
(2.2 p14)
Cluster Sampling
(2.2 p15)
Confidence Interval
(2.3 p18-28)
Standard Error
Continuous Probability Distributions
Normal Distribution
(2.3 p3-8)
Sampling Distribution
(2.2 p24)
Normality Test
(2.3 p29-31)
T-Distribution
Binomial Distribution
Poisson Distribution
Exponential Distribution
Central Limit Theorem
(2.3 p11-17)
A probability distribution of a statistic obtained from a large number of samples drawn from a specific population
Multiple samples creates sampling distribution
States that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger, even if the population distribution is not normal.
This fact holds especially true for sample sizes over 30.
Confidence Interval = (point estimate) +/- (critical value)*(standard error)
where:
critical value from Z-table (Normal distribution table) if population SD is known and sample size > 30
critical value from t-table if population SD is unknown or sample size < 30
Range of values that is likely to contain a population parameter with a certain confidence level
Confidence Level
Probability that a population parameter will fall between the Lower and Upper Confidence Limits
Mean
(2.2 p22-25)
Standard Error = SD / sqrt(n)
where:
SD = standard deviation
n = sample size
Proportion
Standard Error = sqrt( ((P)(1-P)) / n)
where:
P = population proportion
n = sample size
Determine whether sample data has been drawn from a normally distributed population
Use histogram and Q-Q plot to conduct informal Normality Test
Hypothesis Testing
Five steps
- State the hypotheses
(3.1 p15-20)
- Determine a significance level
(3.1 p16-23)
- Identify the test statistic
- Reject or fail to reject the null hypothesis
- Interpret the results
Decision Errors
(3.1 p22-23)
Type I error
(False Positive)
Type II error
(False Negative)
Probability of a Type I error = alpha, α (significance level)
Reject the null hypothesis when it is actually true
Fail to reject the null hypothesis when it is actually false
Probability of a Type II error = Beta, β
Example
Null: No COVID
Alternate: COVID positive
Type I error: Tested COVID positive (reject null) when no COVID
Type II error: Tested negative (do not reject null) when is COVID positive
Null Hypothesis
Alternative Hypothesis
A precise and testable statement
Assumes no relationship between the variables
A statement that is being tested against the null hypothesis
Probability of rejecting the null hypothesis when it is true
Variable of interest (y) is Continuous of Categorical?
Continuous
Categorical or Proportional
1 Sample
(3.1 p24-53)
Population SD known and sample size > 30
Population SD unknown or sample size < 30
Single Sample Z-Test
Single Sample T-Test
2 Samples
(3.1 p54)
One-Tailed and Two-Tailed Tests
One-Tailed
Two-Tailed
Hypothesis involves making a “greater than” or “less than ” statement
Hypothesis involves making an “equal to” or “not equal to” statement.
Independent Samples
(3.1 p55-59)
Paired Samples
(3.1 p60-62)
Population SD known and sample size > 30
Paired Z-Test
Population SD known and sample size > 30
2 Samples Z-Test
Population SD unknown or sample size < 30
2 Samples T-Test
Population SD unknown or sample size < 30
Paired T-Test
Many Independent Samples
2 Independent Variables
1 Independent Variable
One-way ANOVA
(3.2 p9-28)
Two-way ANOVA
(3.2 p29-43)
Hypotheses
Null
Alternate
All the population means are equal
At least one population mean is different from the rest
Multiple Comparison Tests
(3.2 p24-27)
To determine which which population means are different
- Tukey HSD
- Bonferroni Method
3 p-values, 2 for the independent variables, 1 for the interaction between the 2 variables
1 Sample and 2 Options
One Proportion Z-Test
(3.1 p64-66)
Regression
Variable of interest (y) is Binary
Variable of interest (y) is Continuous
Linear Regression
Steps
Data Exploration
Check correlation
(4.1 p9-20)
Build model
Evaluation of model
Do residuals analysis
Ensure variables are significant
Model testing using test data
Pearson correlation coefficient (r) > 0.7 for correlation to be considered strong
Check Goodness of fit of model
(4.1 p28-32)
p-value for coefficients < 0.05
Linearity using Residuals VS Fitted Plot
(4.1 p43)
Normality
Equal variance using Scale-Location Plot
(4.1 p45)
Outliers check using Residuals vs Leverage Plot
(4.1 p46)
Q-Q Plot
(4.1 p44)
Shapiro-Wilk Test p-value < 0.05
(4.1 p41)
Check multicollinearity
All variables are Continuous
Correlation matrix
(4.2 p6)
VIF
(4.2 p7)
If VIF for a variable > 5, the variable is highly correlated with the other variables
If multiple variables with VIF > 5, drop the variable with the highest VIF, run the multi regression, get the VIF and drop the next highest VIF.
Ensure model is better than null model
p-value for model < 0.05
Algorithms to improve regression accuracy
Variables selection methods
Backward elimination
(4.2 p24)
Stepwise
(4.2 p24)
Logistic Regression
(5.2 p30-41)
eliminate variables which are highly correlated keeping the one with highest correlation with y
Contains Categorical variables
Continuous-Categorical
Perform one-hot encoding
(5.1 p6-14)
1 Categorical
Categorical-Categorical
Chi-Square
(5.1 p17)
one-way ANOVA
(5.1 p16)
2 Categorical
two-way ANOVA
Do scatter plot of Y vs X
Quadratic Regression
(5.2 p10-24)
Check interaction effects
(5.1 p21-24)
If the predictor variables are highly correlated
how the y reacts to different levels or level combinations of predictors (Xs)
interaction is significant
when interaction p-value < 0.05
predictor must stay in the model, even if the individual p-value is not significant
Groups must be mutually
exclusive and exhaustive
Check linear relationship between X and Y
Check distribution of Y
Identify and deal with outliers
Transform if not normal
Check for outliers
If non-linear, can try to perform transformation to become linear
(5.2 p9)
Adjusted R-squared > 0.7 is a good model fit
Fit linear regression
(4.1 p24-27)
Split to train and test dataset
using interaction plots
Forward selection
(4.2 p23)
Cross-Validation