Regression

Introduction

Dependence technique: examines the linear relationship between a dependent variable and one or more independent variables.

Analyzing causes and making predictions in various applications.

A deterministic model represents a linear relationship between the variables, but it may not be plausible in many cases.
A probabilistic model incorporates a random error term to account for the variability in the dependent variable.

Estimating model parameter (OLS)

common method for estimating the parameters of a regression model.

minimizes the sum of squared differences between the observed and predicted values of the dependent variable

Predicted Values and Residuals

Predicted values are obtained using the estimated regression model.

Residuals represent the differences between the observed and predicted values.

Residuals are useful for estimating the variance of the error term and assessing the fit of the model

Coefficient of determination (R2)

measures the proportion of variance in the dependent variable explained by the independent variable(s).

ranges from 0 to 1, higher value = a better fit of the model.

can be affected by the number of independent variables used in the model.

Model estimation

Research design

Validation results

Interpretation results

Model spefication

Bj: effect of DV when increase IV by 1 unit -> cereris paribus effect

When in doubt, include irrelevant variables

Maximize degree of freedom = up generalizability

Influential obsv: if deleted have (huge) impact on results

Sample size is important consideration, larger samples are generally preferred for more reliable results.

can be due to errors, exceptional situations, or combinations of characteristics.

Proper handling of influential observations involves data correction, deletion, or model modification

Assumptions

Linear relationship.

Constant variance of the error term

Uncorrelated error terms

Independence of the error term

No perfect multicollinearity

Normality of the error term

Durbin-Watson test: autocorrection of residuals

Heteroscedastic error term

If violated, SE biased

Violations of normality assumption may be less of a concern for large sample sizes

Violate =>>> generalized least squares (GLS) estimator or non-parametric tests like bootstrapping

minimizing the sum of squared residuals through methods such as Ordinary Least Squares (OLS) regression

Testing fit & significance

Multicollinearity

(Adjusted) coefficient of determination (R~2, R2)

R2: proportion of variance in DV can be explained by IV

Adjusted R2: useful for model comparison.

F test: fit of model (H0: R2 = 0 or B1 = B2 =...=0)

Test single effects/coefficients

(H0 - t test): Bj = 0. H0 rejected => xj has sign. effect on DV

corr. among IV increase STD errrors of coefficient estimates

Assess

Insignificant coefficients or coefficients with unexpected signs?

VIF should be <10 (or <5)

Deal

dropping one variable from a highly correlated set

assess the performance and generalizability

Cross-validation

splitting the data into estimation (50-90% total sample) and validation samples, estimating the model on the estimation sample, then assessing the agreement and performance on the validation sample.

Moderating effect

Dummy variable

Nominal and ordinal independent variables can be dichotomized using dummy variables. 1 category chosen as the reference group for comparison.

Building the Interaction Term:

Moderator: variable that affects the direction and strength of IV-DV rela

Metric moderator: Create the interaction term as the product of centered independent and centered moderator variables.

Dichotomous moderator: Create the interaction term as the product of the independent and moderator variables.

Include all simple effects

If the independent variable and moderator correlate, include quadratic terms to differentiate moderating and quadratic effects

OLS regression model

Used to quantify causality between regressors and regressand.
Types of causalities: linear and non-linear relationships.

OLS: minimize squared dev of all points in regres line

Assumptions: see slides

Model evaluation

Analysis of Variance & Correlation:

Goodness-of-fit evaluation through variance analysis.

Explained deviation significant >. non-explained deviation.

Determination coefficient (r2):

Proportion of explained deviation compared to overall deviation.

higher = better explanatory power of regression model

Analysis of variance significance test:

compared explained dev. with non-explained dev.

Hypothesis: explained deviation = 0 ??

Parameter Testing

Assessment of individual coefficients (α and β).

Hypotheses for coefficients: α = 0 and β = 0.

Significance testing to determine if coefficients are valuable.

Multiple regression model

Absence of (multi)collinearity

Adjusted coefficient of determination (r~2)

Explaining variables X are mutually independent
Violated => Very high variances of parameters β

Non-adjusted determination coefficient focuses on deviations only but does not take into account how many explanatory variables have been used
Adjusted coefficient takes the model size into account => Superior quality criterion in multiple linear regression models