OLS Linear Regression is appropriate when

  1. Scientific Question(s) have data which are
    A. One numeric response variable
    B. At least one explanatory** variable which is of any of the following types:
  • numeric
  • categorical w/more than 2 categories✏
  1. We are interested in the mean response
  2. Observations are not correlated
  3. We can find a model that is linear in the parameters to fit the data

**Additional explanatory variables may be of any type
✏ANOVA is a special case of linear regression

we want to test a hypothesis/estimate parameter/describe a relationship

data was collected from an

observational study

determine any confounders and precision variables between explanatory variable of interest and response variable

experiment

Plot numeric explanatory variables against response to do a "pre check" for functional form, obvious groups, extreme outliers, etc. Create side-by-side boxplots or stacked histograms for categorical explantory variables Create a "Table 1" with summary statistics for the variables you will use in your regression.

adjust model and fit again

check for non-constant variance

check functional form

Pick initial model to answer question(s). Do scientific questions require an interaction? Be sure to include confounders and precision variables

click to edit

fit model

ˆβ=(XTX)1XTY

lm(yvar~xvar1+xvar2+...+xvark, datasetname)
for interaction replace + with *

prediction is goal

Create univariate graphs of predictors and response. Try to correct functional form and/or non-constant variance. Note any highly correlated predictors.

Create new covariates for interactions and/or functions of variables if necessary

search and assess models
(Assuming goal is to minimize out of sample MSE)

Ways to assess model

In sample estimates

Adjusted \(R^2\)

mallows \(C_p\)

Ways to search possible models

Stepwise fwd/backward

Best subsets

BIC

hypothesis generation

Use a combination of predictive and inferential methods to look for explanatory variables that might be related to response. Cannot make valid conclusion, because hypotheses not specified aprori. Further data collection and testing to confirm

Predictive interval, Confidence interval for mean of prediction

overfitting

interpretability

Do I care about intervals, interpretability, point estimates?

\(R^2\)

click to edit

Do you have non-constant variance?
Plot residuals against fitted values
spread of residuals change with fitted values?

no

Check for correct functional form plot residuals against continuous explanatory variables. Do you have curves?

yes

try transforming x. If unable to correct or don't wish to transform, make x categorical.

fit

no

no outliers/influential points? or have you already fit extra models for sensitivity analysis?

yes

was variance constant?

no

is n large?

no

D

Yes

B

yes

are residuals approximately normal?
check with qqplot, and/or histogram

no

is n large?

yes

C

no

D

yes

A

no

remove and fit model(s) for sensitivity analysis

yes

  1. try GLM methods if you know them.
  2. If cone shaped, try ln(y) to fix.
  3. If you continue to have non-constant variance, note that and move on.

Check for approximate normality of errors: qqplot of residuals, histogram of residuals

Make note of:

  1. whether you had non-constant variance
  2. sample size
  3. approximate normality of residuals.

Go Do Inference!

AIC

x-validation/ leave n-out + MSE

in sample MSE

informed guessing

divide into validation and training +MSE