Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chp. 14 - Correlation & Regression - Coggle Diagram
Chp. 14 - Correlation & Regression
The Characteristics of a Relationship
Correlation
is a statistical technique that is used to measure and describe the relationship between two variables.
Usually the two variables in a correlational study are simply observed as they exist naturally in the environment—there is no attempt to control or manipulate the variables.
The pairs of scores can be listed in a table, or they can be presented graphically in a
scatter plot
. It allows you to see any patterns or trends that exist in the data.
x - value : independent variable
y - value : dependent variable
The sign of the correlation, positive or negative, describes the
direction of the relationship
.
In a
positive correlation
, the two variables tend to change in the same direction
In a
negative correlation
, the two variables tend to go in opposite directions.
The Form of the Relationship.
The most common is straight line relationships or linear form.
The Strength of Consistency of the Relationship
The degree of relationship is measured by the numerical value of the correlation.
A
perfect correlation
always is identified by a correlation of 1.00 and indicates a perfectly consistent relationship. For a correlation of 1.00 (or −1.00), each change in X is accompanied by a perfectly predictable change in Y.
At the other extreme, a correlation of 0 indicates no consistency at all. For a correlation of 0, the data points are scattered randomly with no clear trend. Intermediate values between 0 and 1 indicate the degree of consistency.
Sketching a line around the data points is called an
envelope
which helps you see the overall trend in the data. Football shape = correlation around 0.7. Fatter than football = correlation closer to 0. Narrow shape = correlation closer to 1.00.
If you sketch a line around the data points, it's called an
envelope,
which helps you see the overall trend in the data. Football shape = correlation around 0.7. Fatter than football = correlation closer to 0. Narrow shape = correlation closer to 1.00.
The Pearson Correlation:
measures the degree and the direction of the linear relationship between two variables.
The Sum of Products of Deviations
We use sum of products, or SP, to measure the amount of covariability between two variables.
Correlation & the Pattern of Data Points
The correlation is perfectly consistent with the pattern formed by the data points.
Positive sign = line slopes up to the right; High value for the correlation (near 1.00) indicates points are tightly clustered close to the line.
Because the Pearson correlation describes the pattern formed by the data points, any factor that does not change the pattern also does not change the correlation.
In summary, adding a constant to (or subtracting a constant from) each X and/or Y value does not change the pattern of data points and does not change the correlation. Also, multiplying (or dividing) each X or each Y value by a positive constant does not change the pattern and does not change the value of the correlation.
Multiplying by a negative constant, however, produces a mirror image of the pattern and, therefore, changes the sign of the correlation.
The Pearson Correlation & z-Scores
The Pearson correlation measures the relationship between an individual’s location in the X distribution and his or her location in the Y distribution.
Z-scores identify the exact location of each individual score within a distribution. With this in mind, each X value can be transformed into a z-score, zx , using the mean and standard deviation for the set of Xs. Similarly, each Y score can be transformed into zy . If the X and Y values are viewed as a sample, the transformation is completed using the sample formula for z
If X and Y values form a complete population, the z-scores are computed using a separate equation.
Using & Interpreting the Pearson Correlation
Where & Why Correlations Are Used
Preciction:
If two variables are known to be related in some systematic way, it is possible to use one of the variables to make accurate predictions about the other.
Validity:
You can demonstrate the validity of the test by using correlation.
Reliability:
A measurement procedure is considered reliable to the extent that it produces stable, consistent measurements. A reliable measurement procedure will produce the same (or nearly the same) scores when the same individuals are measured twice under the same conditions. When reliability is high, the correlation between two measurements should be strong and positive.
Theory Verification:
the prediction of the theory could be tested by determining the correlation between the two variables.
Interpreting Correlations
Correlation simply describes a relationship between two variables. It does not explain why the two variables are related. Specifically, a correlation should not and cannot be interpreted as proof of a cause-and-effect relationship between the two variables.
The value of a correlation can be affected greatly by the range of scores represented in the data.
One or two extreme data points, often called
outliers
, can have a dramatic effect on the value of a correlation.
A correlation should not be interpreted as a proportion. To describe how accurately one variable predicts the other, you must square the correlation. Thus, a correlation of means that one variable partially predicts the other, but the predictable portion is only (or 25%) of the total variability.
Correlation & Causation
One of the most common errors in interpreting correlations is to assume that a correlation necessarily implies a cause-and-effect relationship between the two variables.
Although there may be a causal relationship, the simple existence of a correlation does not prove it.
To establish a cause-and-effect relationship, it is necessary to conduct a true experiment (see The Experimental Method ) in which one variable is manipulated by a researcher and other variables are rigorously controlled.
Correlation & Restricted Range
Whenever a correlation is computed from scores that do not represent the full range of possible values, you should be cautious in interpreting the correlation.
The correlation within this restricted range could be completely different from the correlation that would be obtained from a full range of scores.
Outlier:
An individual with X and/or Y values that are substantially different (larger or smaller) from the values obtained for the other individuals in the data set.
The data point of a single outlier can have a dramatic influence on the value obtained for the correlation.
The problem of outliers is a good reason for looking at a scatter plot instead of simply basing your interpretation on the numerical value of the correlation. If you only “go by the numbers,” you might overlook the fact that one extreme data point inflated the size of the correlation.
Correlation & the Strength of the Relationship
A correlation measures the degree of relationship between two variables on a scale from 0 to 1.00. Although this number provides a measure of the degree of relationship, the squared correlation provides a better measure of the strength of the relationship.
One of the common uses of correlation is for prediction.
In general, the squared correlation measures the gain in accuracy that is obtained from using the correlation for prediction. The squared correlation measures the proportion of variability in the data that is explained by the relationship between X and Y. It is sometimes called the
coefficient of determination
.
The value is called the
coefficient of determination
because it measures the proportion of variability in one variable that can be determined from the relationship with the other variable. A correlation of (or −0.80), for example, means that (or 64%) of the variability in the Y scores can be predicted from the relationship with X.
Hypothesis Tests with the Pearson Correlation
The Hypotheses
The basic question for this hypothesis test is whether a correlation exists in the population.
The
null hypothesis
is “No. There is no correlation in the population,” or “The population correlation is zero.”
The alternative hypothesis is “Yes. There is a real, nonzero correlation in the population.”
Samples are not expected to be identical to the populations from which they come; there will be some discrepancy (sampling error) between a sample statistic and the corresponding population parameter.
Specifically, you should always expect some error between a sample correlation and the population correlation it represents.
The purpose of the hypothesis test is to decide between 2 interpretations:
There is no correlation in the population ρ and the sample value is the result of sampling error. Remember, a sample is not expected to be identical to the population. There always is some error between a sample statistic and the corresponding population parameter. This is the situation specified by H0 .
The nonzero sample correlation accurately represents a real, nonzero correlation in the population. This is the alternative stated in H1 .
The correlation from the sample will help to determine which of these two interpretations is more likely. A sample correlation near zero supports the conclusion that the population correlation is also zero. A sample correlation that is substantially different from zero supports the conclusion that there is a real, nonzero correlation in the population.
The Hypothesis Test
The hypothesis test evaluating the significance of a correlation can be conducted using either a t statistic or an F-ratio.
The Pearson correlation is generally computed for sample data. As with most sample statistics, however, a sample correlation is often used to answer questions about the corresponding population correlation.
Alternatives to the Pearson Correlation
aka correlations for non-linear relationships
The Spearman Correlation
When the Pearson correlation formula is used with data from an ordinal scale (ranks), the result is called the
Spearman correlation
.
The Spearman correlation is used to measure the relationship between X and Y when both variables are measured on ordinal scales.
The Spearman correlation can be used as a valuable alternative to the Pearson correlation, even when the original raw scores are on an interval or a ratio scale.
The Spearman correlation can be used to measure the degree to which a relationship is consistently one directional, independent of its form. I
When there is a consistently one-directional relationship between two variables, the relationship is said to be
monotonic.
Thus, the Spearman correlation measures the degree of monotonic relationship between two variables.
2 Situations which Spearman correlation is used:
Spearman is used when the original data are ordinal; that is, when the X and Y values are ranks. In this case, you simply apply the Pearson correlation formula to the set of ranks.
The Spearman correlation is used when a researcher wants to measure the degree to which the relationship between X and Y is consistently one directional, independent of the specific form of the relationship.
Ranking Tied Scores
Whenever two scores have exactly the same value, their ranks should also be the same.
Special Formula for the Spearman Correlation
Instead of using the Pearson formula after ranking the data, you can put the ranks directly into a simplified formula,
However, note that this special formula should be used only after the scores have been converted to ranks and when there are no ties among the ranks.
The Point-Biserial Correlation & Measuring Effect Size with r2
The
point-biserial correlation
is used to measure the relationship between two variables in situations in which one variable consists of regular, numerical scores, but the second variable has only two values.
A variable with only two values is called a dichotomous variable or a binomial variable. Ex) success vs failure; 1st born vs later-born child; older than 30 years old vs younger than 30 years old
To compute the point-biserial correlation, the dichotomous variable is first converted to numerical values by assigning a value of zero (0) to one category and a value of one (1) to the other category.
The Phi-Coefficient
When both variables (X and Y) measured for each individual are dichotomous, the correlation between the two variables is called the
phi-coefficient
.
Convert each of the dichotomous variables to numerical values by assigning a 0 to one category and a 1 to the other category for each of the variables.
Use the regular Pearson formula with the converted scores.
Intro to Linear Equations & Regression
Linear Equations
Slope:
determines how much the Y variable changes when X increased by one point. Represented by value b.
Regression
The statistical technique for finding the best-fitting straight line for a set of data is called
regression
, and the resulting straight line is called the
regression line
.
The Standard Error of Estimate
The
standard error of estimate
gives a measure of the standard distance between the predicted Y values on the regression line and the actual Y values in the data.
If the correlation is near 1.00 (or −1.00), the data points are clustered close to the line, and the standard error of estimate is small. As the correlation gets nearer to zero, the data points become more widely scattered, the line provides less accurate predictions, and the standard error of estimate grows larger.
The regression equation simply describes the best-fitting line and is used for making predictions. However, and the standard error of estimate indicate how accurate these predictions will be.
Analysis of Regression: The Significance of the Regression Equation
The process of testing the significance of a regression equation is called
analysis of regression
and is very similar to the analysis of variance (ANOVA)