Quantitative Methods (Regression (
A dependent variable…
- A dependent variable measures an outcome of a study (also called a
- An independent variable explains changes in the dependent variable based
on a theory (also called an explanatory variable).
- A regression model can include 1 or more independent variables of any type:
numerical (e.g. number of workers) or
categorical (e.g. gender).
- A linear regression model always has 1 numerical dependent variable.
- A regression line is a straight line that describes how a dependent variable
y changes as an independent variable x changes.
Association is not causation
Regression is strongly
affected by outliers
R²: a measure of goodness of fit
R² is the percentage of the variation of the dependent variable that is explained
by the model: R² is always between 0 and 1 (100%).
R² is often used as a measure for the quality of the model: “if R² is large, then
the model is good”. However: we will emphasize later that other aspects (such
as multicollinearity or omitted variables) are even more important to evaluate
the quality of a model.
R² is the square of the correlation r.
– If r is close to 0 (weak relation) then R² is close to 0.
– If r is close to -1 or close to +1 (strong relation) then R² is close to 1.
Topics in regression analysis
- Inference in regression analysis
- Significance tests
- (a) t-tests are used to test the value of one
coefficient. The t-tests are important because they test whether the effect of X on Y is significant
- (b) F-tests are used to test the values of
various coefficients at the same time.
The most common applications of F-tests are:
Testing whether all the coefficients are zero (implying a worthless model). So this tests the quality of the model.
Testing the significance of a categorical variable.
Categorical variables are included via dummy coding, so various coefficients must be tested.
Testing for seasonal patterns in time series (see lesson 14)
- Interaction terms
- Omitted variable bias
- Detecting problems in regression analysis
- Logistic regression
- Time Series Analysis
Omitted variables are relevant independent variables that you do not include in the regression model, e.g. because you forgot the variable or because you have no data about it.
Interaction terms are used to model a situation in which the effect of X1 on Y depends on the value of X2
Collinearity (or multicollinearity) is the problem resulting from too much correlation between the independent variables in a regression model.
Heteroscedasticity is the situation where the variance of the error term is not constant.
In Logistic Regression we model the probability p, that the
dependent dummy variable Y takes the value 1 for observation i.
- The mean ̅x and standard deviation s calculated from actual observations.
- The mean μ (Greek letter ‘mu’) and standard deviation σ (‘sigma’) calculated from idealized observations.
The entire group that we want information
about is called the population.
A sample is a part of the population
that we examine in order to gather information
- A statistic: a number that describes a sample (e.g. ̅ , , …)
- A parameter: a number that describes the population (e.g. μ, σ, ...)
The standard normal distribution is the normal distribution N(0,1) with mean 0 and standard deviation 1.
- If a variable x has a normal distribution N(μ, σ), then the standardized variable Z
has a standard normal distribution.
- A standardized value is often called a z-score.
Interpretation: the number of standard deviations above the
NB Table always gives values to the leftNormal calculations
Step 1. Standardize
Step 2. Use the tableInverse Normal calculations
Step 1. Use the table
Step 2. UnstandardizeThe shape of the distribution of ̅x
*, the mean of a random sample,
- If the population has a normal distribution, then the sample mean ̅x is also exactly normal.
- But even if the population distribution is not normal, then ̅x will be approximately normal if the sample size is large (i.e. a uniform distribution with a sample size > 30).
This famous fact is called the Central Limit Theorem. # #
Data does not allow for meaningful
Data does allow for meaningful
Relation between a categorical and a numerical variable:
e.g. gender (male/female) and wage
• Graph: side-by-side boxplot
• Statistical numbers: compare means
Relation between two numerical variables:
e.g. age and wage
• Graph: scatterplot
• Statistical numbers: correlation coefficient