Please enable JavaScript.
Coggle requires JavaScript to display documents.
Correlation and Regression - Coggle Diagram
Correlation and Regression
Scatter plot
: A graphical display that shows the relationship between two variables by plotting individual data points on a two-dimensional graph, with one variable on the X-axis and the other on the Y-axis.
Positive correlation
: A relationship between two variables in which they tend to change in the same direction—as one variable increases, the other also increases, and as one decreases, the other also decreases.
Negative correlation
: A relationship between two variables in which they tend to change in opposite directions—as one variable increases, the other decreases, and vice versa.
Correlation
: A statistical measure that describes the strength and direction of the relationship between two variables. It quantifies how changes in one variable are associated with changes in another variable.
Linear relationship
: A relationship between two variables that can be represented by a straight line. When graphed, the data points cluster around a line rather than a curve.
Pearson correlation
: A statistical measure (denoted as r) that quantifies the strength and direction of the linear relationship between two continuous variables. Values range from –1 to +1, with values closer to the extremes indicating stronger relationships.
Envelope
: The elliptical shape formed by the cluster of data points in a scatter plot. A narrow, elongated envelope indicates a strong correlation, while a circular envelope indicates a weak or zero correlation.
Perfect correlation
: A correlation of exactly +1 or –1, indicating that all data points fall precisely on a straight line. In a perfect correlation, knowing the value of one variable allows you to predict the other with complete accuracy.
Sum of products (SP)
: A measure of the co-variability between two variables, calculated by summing the products of the deviations of each variable from their respective means. SP indicates whether two variables vary together (positive SP) or in opposite directions (negative SP).
Outliers:
Extreme data points that are substantially different from the rest of the data. Outliers can have a disproportionate influence on correlation and regression analyses, either inflating or deflating the correlation coefficient.
SP definitional formula
: The conceptual formula for calculating SP: SP = Σ(X – MX)(Y – MY). This formula multiplies each X deviation by its corresponding Y deviation and sums the products
SP computational formula
: An algebraically equivalent formula that is often easier to calculate: SP = ΣXY – (ΣX)(ΣY)/n. This formula uses raw scores rather than deviations from the mean.
Restricted range
: A limitation in data collection where the range of scores for one or both variables is narrower than the full possible range. Restricted range typically reduces the observed correlation, making it an underestimate of the true relationship
Coefficient of determination
: The squared correlation coefficient (r²) that indicates the proportion of variance in one variable that is explained or predicted by its relationship with another variable. It ranges from 0 to 1, with higher values indicating more explained variance.
Monotonic relationship
: A relationship between two variables in which one variable consistently increases (or consistently decreases) as the other increases, but not necessarily at a constant rate. Unlike linear relationships, monotonic relationships can include curves as long as the direction never reverses.
Correlation matrix:
A table that displays the correlation coefficients between multiple variables. Each cell shows the correlation between the variable in its row and the variable in its column, allowing for easy comparison of relationships among several variables.
Spearman correlation
: A non-parametric measure of correlation that assesses the relationship between two variables using ranked data rather than raw scores. It is appropriate when data are ordinal or when the relationship is monotonic but not necessarily linear.
Dichotomous variable or binomial variable:
A variable that has only two possible values or categories, such as male/female, yes/no, or pass/fail. These variables are often coded numerically (0 and 1) for statistical analysi
Phi-coefficient
: A correlation coefficient used when both variables are dichotomous. It measures the association between two binary variables and ranges from –1 to +1.
Point-biserial correlation
: A correlation coefficient used when one variable is continuous and the other is dichotomous (having only two categories). It is mathematically equivalent to the Pearson correlation when the dichotomous variable is coded as 0 and 1.
Linear equation
: A mathematical equation that describes a straight line, typically written as Y = bX + a, where b is the slope and a is the Y-intercept. This equation allows prediction of Y values from corresponding X values.
Slope
: The steepness of a line, represented by b in the regression equation. It indicates how much Y changes for each one-unit increase in X. A positive slope means Y increases as X increases; a negative slope means Y decreases as X increases.
Y-intercept
: The point where the regression line crosses the Y-axis, represented by a in the regression equation. It indicates the predicted value of Y when X equals zero.
Regression
: A statistical technique used to describe and predict the relationship between variables. It finds the best-fitting line through a set of data points, allowing prediction of one variable based on another.
Regression line
: The straight line that best fits the data points in a scatter plot. It represents the predicted relationship between X and Y and is positioned to minimize prediction errors.
Regression equation for Y
: The equation used to predict Y values from X values: Ŷ = bX + a, where Ŷ is the predicted Y value, b is the slope (SP/SSX), and a is the Y-intercept (MY – bMX)
Analysis of regression
: A statistical procedure that evaluates the significance of a regression equation by comparing predicted variability to unpredicted variability. It uses an F-ratio to test whether the regression equation accounts for a significant portion of the variance in the dependent variable, determining if the relationship between X and Y is statistically meaningful.
Least-squared-error solution
: The mathematical method used to find the regression line by minimizing the sum of the squared differences (residuals) between the actual Y values and the predicted Y values. This ensures the best possible fit to the data.
Standard error of estimate:
A measure of the average distance between the actual Y values and the predicted Y values (Ŷ) from the regression line. It quantifies the typical size of prediction errors and is calculated as the square root of the mean squared residual
Predicted variability
: The portion of the total variability in Y that is explained by the regression equation, also called SSregression. It represents how much of the variance in Y can be accounted for by its relationship with X.
Unpredicted variability
: The portion of the total variability in Y that is not explained by the regression equation, also called SSresidual. It represents the variance in Y that remains after accounting for the relationship with X, reflecting prediction errors.
Standardized form of the regression equation
: The regression equation expressed using z-scores rather than raw scores: ẑY = β(zX). In simple regression, the standardized coefficient β equals the Pearson correlation coefficient r.