Please enable JavaScript.
Coggle requires JavaScript to display documents.
Regression (Simple Linear Regression (Equation: outcome = model + error…
Regression
Simple Linear Regression
Equation: outcome = model + error
Yi = bo + b1Xi + ei
Yi
: outcome
Model
bo/b1:
regression coefficients (parameters)
b1:
regression coefficient of IV
slope
direction/str. of relationship
bo
: y- intercept.
DV when IV = 0
Xi
: ith ppt's score on IV
ei
: residual term (not always included)
vertical dif btwn i's predicted score and actual score
represents that model will not perfectly fit data
To use
:
Find line of best fit
Get estimates of slope/intercept
Plug in IV values to estimate value of DV
Assessing model
Finding line of best fit: Method of Least Squares
*
look up
Goodness of Fit
F-statistic
test ability if linear regression model to predict outcome
AKA is model about to significantly predict outcome?
R2
% of variance in DV explained by model
effect
size
Indiv. Predictors (b)
test whether IV significantly predicts DV
b
If
statistically significant
, IV makes signif. contribution to predicting DV
Slope ; str. of relationship
good: signif. dif from 0
DV must be continuous
Purpose: predict DV from 1+ IV's
simple: 1 IV
Multiple: 1+ IV
Multiple Regression
Equation
y = b0 + b1X1 + b2X2 + ... + bnXn + ei
predicted from combination of all variables x respective coefficients + error
b0
: DV when all X = 0
bn
: regression coefficient of nth variable
If 1 IV, then
plane
instead of
line
Methods of Entry
selection of predictors
which, and in what order
must be based on past-research/theory
Goal: parsimonious model
accomplishes a desired level of explanation or prediction with as few predictor variables as possible
Choosing a method
Forced Entry
used when no precedents for research question
Stepwise
Concerns:
statistical significance may not match theoretical importance
overfitting
: too many predictors that don’t add much
underfitting
: missing important predictors
backward preferable to forward bc less change of Type II error
Final line: Limit use of stepwise methods to exploratory analysis
Hierarchical
(Best)
Pro
based on theory/research
can see unique effect of
new
variable on DV
Con:
takes skill
how predictors are entered into model
Stepwise
predictors selected by computer based on semi-partial correlation w/ outcome
3 methods
Forward
computer adds 1 signif. predictor at a time
Step-wise
same as forward, but removes any that become non-signif. at each step
Backward
puts all in, then removes 1 non-signif. predictor at a time
Forced Entry
experimenter enters all predictors simultaneously
controls for effects of all other variables
Hierarchical
(blockwise)
experimenter decides order
variables of most interest in
last
Purpose:
ID if "new" variables predict outcome
control for covariates
past research on some variables, but others exploratory
steps vs blocks
groups, with stepwise (forward, backward, stepwise)
groups, forced entry in each block
one by one
Determining quality of model
(Bias?)
Represent all observed data
VS
Influenced by small number of cases
Detecting Outliers
Cue: Model won't predict score very well
Look for any case with
large
residual
Determining Large residual
Standardized residuals
z-scores
assess size, apply universal guidelines
Residuals analysis
test amount of error in a model
ID extreme cases a/o outliers
ID Influential cases
Tests of Influence
(of single case)
Cook's Distance
meas overall influence of case on model
Value >1 ~ influential case
Leverage
(aka hat values)
influence of observed value of DV on predicted values
Avg. leverage value
3 more items...
Mahalanobis Distances
distance btwn cases & means of predictors
2 more items...
DFBeta
dif btwn parameter estimates using all cases VS when one case in excluded
1 more item...
standardized DFBeta
1 more item...
absolute values >1 -> influential case
Does data pt consistently influences model?
Generalizability
to other samples
Underlying Assumptions
Independent Errors
For any 2 obs., resitual terms should be uncorrelated
Durbin-Watson test
<1 or >3 = violation of assumption
<2 = (+)correlation
2 = (-)correlation
2 = uncorrelated
Homoscedasticity
at each level of IV, variance in residual terms should be constant
test visually
Multicollinearity
Identifying Multicollinearity
Tolerance
reciprocal of VIF (1/VIF)
values <.01 - problem
Variance Inflation Factor (VIF)
10 - problem
whether predictor has strong linear relationship w/ other predictors
correlations >.80, >.90 btwn 2 IV
only bivariate; misses subtle forms of collinearlity
strong correlation btwn 2+ predictors
??? There should or shouldn't be multicollinearity?
No perfect multicollinearity
NO perfect linear relationship btwn 2+ predictors
Independence
All values of outcome variable are independent (come from separate entity)
Predictors are uncorrelated w/ "external variables"
influential variables not included in regression model
Non-zero variance
variance =/= 0
predictors should have some variation in value
Variable types
DV continuous & unbounded
ex. if variable can range from 1-10, but all values are 3-7, then variability is constrained
unbounded: no constraints on variability
all predictors quatitative OR dichotomous
Normally distributed errors
residuals in model are normally distributed
observed data doesn't need to be normally distributed
?
test by examining histogram, p-p plots
Linearity
relationship we are modeling is linear
test by examining scatterplots
Cross-validation
assess accuracy of model across dif samples
if predictive power drops when applied to dif sample, then model does not generalize
Methods of Cross-Validation
Adjusted R2
indicates loss of predictive power from sample to population
shrinkage
estimate how much variance in Y would be accounted for if model had been derived from population
????
Data Splitting
Randomly split data set, compare regression equations
Compare R2 & b in the 2 samples to see how well the model generalizes