Regression
Introduction
Dependence technique: examines the linear relationship between a dependent variable and one or more independent variables.
Analyzing causes and making predictions in various applications.
A deterministic model represents a linear relationship between the variables, but it may not be plausible in many cases.
A probabilistic model incorporates a random error term to account for the variability in the dependent variable.
Estimating model parameter (OLS)
common method for estimating the parameters of a regression model.
minimizes the sum of squared differences between the observed and predicted values of the dependent variable
Predicted Values and Residuals
Predicted values are obtained using the estimated regression model.
Residuals represent the differences between the observed and predicted values.
Residuals are useful for estimating the variance of the error term and assessing the fit of the model
Coefficient of determination (R2)
measures the proportion of variance in the dependent variable explained by the independent variable(s).
ranges from 0 to 1, higher value = a better fit of the model.
can be affected by the number of independent variables used in the model.
- Model estimation
- Research design
- Validation results
- Interpretation results
- Model spefication
Bj: effect of DV when increase IV by 1 unit -> cereris paribus effect
When in doubt, include irrelevant variables
Maximize degree of freedom = up generalizability
Influential obsv: if deleted have (huge) impact on results
Sample size is important consideration, larger samples are generally preferred for more reliable results.
can be due to errors, exceptional situations, or combinations of characteristics.
Proper handling of influential observations involves data correction, deletion, or model modification
- Assumptions
Linear relationship.
Constant variance of the error term
Uncorrelated error terms
Independence of the error term
No perfect multicollinearity
Normality of the error term
Durbin-Watson test: autocorrection of residuals
Heteroscedastic error term
If violated, SE biased
Violations of normality assumption may be less of a concern for large sample sizes
Violate =>>> generalized least squares (GLS) estimator or non-parametric tests like bootstrapping
minimizing the sum of squared residuals through methods such as Ordinary Least Squares (OLS) regression
Testing fit & significance
Multicollinearity
(Adjusted) coefficient of determination (R~2, R2)
R2: proportion of variance in DV can be explained by IV
Adjusted R2: useful for model comparison.
F test: fit of model (H0: R2 = 0 or B1 = B2 =...=0)
Test single effects/coefficients
(H0 - t test): Bj = 0. H0 rejected => xj has sign. effect on DV
corr. among IV increase STD errrors of coefficient estimates
Assess
Insignificant coefficients or coefficients with unexpected signs?
VIF should be <10 (or <5)
Deal
dropping one variable from a highly correlated set
assess the performance and generalizability
Cross-validation
splitting the data into estimation (50-90% total sample) and validation samples, estimating the model on the estimation sample, then assessing the agreement and performance on the validation sample.
Moderating effect
Dummy variable
Nominal and ordinal independent variables can be dichotomized using dummy variables. 1 category chosen as the reference group for comparison.
Building the Interaction Term:
Moderator: variable that affects the direction and strength of IV-DV rela
Metric moderator: Create the interaction term as the product of centered independent and centered moderator variables.
Dichotomous moderator: Create the interaction term as the product of the independent and moderator variables.
Include all simple effects
If the independent variable and moderator correlate, include quadratic terms to differentiate moderating and quadratic effects
OLS regression model
Used to quantify causality between regressors and regressand.
Types of causalities: linear and non-linear relationships.
OLS: minimize squared dev of all points in regres line
Assumptions: see slides
Model evaluation
Analysis of Variance & Correlation:
Goodness-of-fit evaluation through variance analysis.
Explained deviation significant >. non-explained deviation.
Determination coefficient (r2):
Proportion of explained deviation compared to overall deviation.
higher = better explanatory power of regression model
Analysis of variance significance test:
compared explained dev. with non-explained dev.
Hypothesis: explained deviation = 0 ??
Parameter Testing
Assessment of individual coefficients (α and β).
Hypotheses for coefficients: α = 0 and β = 0.
Significance testing to determine if coefficients are valuable.
Multiple regression model
Absence of (multi)collinearity
Adjusted coefficient of determination (r~2)
Explaining variables X are mutually independent
Violated => Very high variances of parameters β
Non-adjusted determination coefficient focuses on deviations only but does not take into account how many explanatory variables have been used
Adjusted coefficient takes the model size into account => Superior quality criterion in multiple linear regression models