Welcome to Correlation and Regression!
Hello future statistician! This chapter is where mathematics meets the real world. We move beyond just describing data and start asking the exciting question: "How do two different things relate to each other, and can we use one to predict the other?"
You will learn the tools necessary to analyze relationships, such as how hours studied affects exam scores, or how temperature impacts the sale of ice cream. Don't worry if this seems tricky at first; we will break down every calculation and concept step-by-step!
I. Visualizing Relationships: Scatter Diagrams
The first step in analyzing the relationship between two variables is to draw a picture. This picture is called a Scatter Diagram.
What is a Scatter Diagram?
A scatter diagram plots pairs of data points \((x, y)\) onto a standard Cartesian graph.
- The Independent Variable (\(x\)) is usually plotted on the horizontal axis. This variable is the one we think might influence the other. (Think of it as the cause or the input.)
- The Dependent Variable (\(y\)) is plotted on the vertical axis. This is the variable whose value depends on \(x\). (Think of it as the effect or the output.)
Types of Correlation
By looking at the pattern the dots form, we can describe the correlation, which is the strength and direction of the linear relationship between the variables.
1. Positive Correlation
- As \(x\) increases, \(y\) tends to increase.
- The points generally slope upwards from left to right.
- Example: The more hours you run, the longer the distance you cover.
2. Negative Correlation
- As \(x\) increases, \(y\) tends to decrease.
- The points generally slope downwards from left to right.
- Example: The older a car is, the lower its resale value.
3. Zero or No Correlation
- There is no clear pattern or relationship between \(x\) and \(y\).
- The points are scattered randomly.
- Example: A person’s height and the number of pets they own.
We use a scatter diagram to determine the Direction (Positive or Negative) and the Strength (How tightly grouped the points are) of the relationship.
II. Quantifying Correlation: The PMCC (\(r\))
Our visual judgment of a scatter diagram is subjective. To get an objective, numerical measure of linear correlation, we use the Product Moment Correlation Coefficient, usually denoted by \(r\).
The Product Moment Correlation Coefficient (\(r\))
The PMCC is a value that measures the strength and direction of linear correlation.
Key Properties of \(r\)
- The value of \(r\) must always be between \(-1\) and \(+1\), inclusive: \(-1 \le r \le 1\).
- A value of \(r = +1\) means Perfect Positive Linear Correlation (all points lie exactly on a straight line sloping up).
- A value of \(r = -1\) means Perfect Negative Linear Correlation (all points lie exactly on a straight line sloping down).
- A value of \(r = 0\) means No Linear Correlation.
Interpretation: Strength of Correlation
How do we describe values between 0 and 1 (or 0 and -1)?
- Strong Correlation: \(r\) is close to \(-1\) or \(+1\) (e.g., \(r = 0.9\) or \(r = -0.85\)). The points are very close to a straight line.
- Moderate Correlation: \(r\) is moderately far from 0 (e.g., \(r = 0.5\) or \(r = -0.4\)).
- Weak Correlation: \(r\) is close to 0 (e.g., \(r = 0.1\) or \(r = -0.2\)). The points are widely scattered.
Memory Aid: Think of \(r\) as your relationship status meter. 1 is "Perfect Match," -1 is "Perfect Opposites," and 0 is "Stranger."
!!! Important Caution !!!
The PMCC only measures linear relationships. If the data forms a strong curve (a non-linear relationship), \(r\) might be close to zero, even though there is a very strong relationship present! You must always look at the scatter diagram first.
The PMCC is typically calculated using summary statistics like \(S_{xx}\), \(S_{yy}\), and \(S_{xy}\). These values are usually given to you in exam questions, or you use your calculator's statistical functions to find \(r\). The full formula is: \[ r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}} \]
III. Linear Regression: Finding the Line of Best Fit
If we establish a correlation is strong and linear, we can define a straight line that best describes this relationship. This is called the Regression Line. Its primary purpose is prediction.
The Regression Line \(y\) on \(x\)
In the S1 curriculum, we focus on the line that allows us to predict the dependent variable \(y\) from the independent variable \(x\). The standard equation form is:
\[ \mathbf{y = a + bx} \]
- \(y\): The predicted value of the dependent variable.
- \(x\): The value of the independent variable used for prediction.
- \(b\): The gradient (or slope) of the line. This tells us how much \(y\) changes for every 1 unit increase in \(x\).
- \(a\): The \(y\)-intercept. This is the predicted value of \(y\) when \(x=0\).
Step-by-Step Calculation of \(a\) and \(b\)
We need the same summary statistics we used for \(r\): \(S_{xx}\) and \(S_{xy}\).
Step 1: Calculate the Gradient (\(b\))
The formula for the gradient \(b\) (the regression coefficient) is: \[ b = \frac{S_{xy}}{S_{xx}} \]
Note: The sign of \(b\) must match the sign of \(r\). If \(r\) is positive, \(b\) must be positive (positive correlation).
Step 2: Calculate the Y-Intercept (\(a\))
The regression line must always pass through the mean point \((\bar{x}, \bar{y})\). We use this fact to find \(a\): \[ \bar{y} = a + b\bar{x} \] Rearranging this gives: \[ \mathbf{a = \bar{y} - b\bar{x}} \]
Common Mistake: Students often confuse \(b\) with the formula for \(r\). Remember, \(b\) only involves \(S_{xy}\) and \(S_{xx}\), while \(r\) uses \(S_{yy}\) too.
Interpreting \(a\) and \(b\)
It is crucial to be able to interpret the meaning of your calculated coefficients in context:
- Interpretation of \(b\): "For every 1 unit increase in [variable \(x\)], the [variable \(y\)] is predicted to increase/decrease by \(|b|\) units."
- Interpretation of \(a\): "The predicted value of [variable \(y\)] when [variable \(x\)] is zero is \(a\)." (Be careful: \(x=0\) might not make sense in the context, like predicting salary for 0 years of experience).
The equation \(y = a + bx\) is our tool for prediction. We calculate \(b\) first, and then use the means \((\bar{x}, \bar{y})\) to calculate \(a\).
IV. Reliability of Predictions (Interpolation vs. Extrapolation)
Once we have our regression line, we can use it to predict values of \(y\) for specific values of \(x\). But how reliable are these predictions?
1. Interpolation
This occurs when we use a value of \(x\) that lies within the range of the original data.
- Example: If our data set ranges from 10 hours studied to 50 hours studied, predicting the score for someone who studied 30 hours is interpolation.
- Reliability: Predictions using interpolation are generally reliable, provided the PMCC (\(r\)) is close to \(+1\) or \(-1\).
2. Extrapolation
This occurs when we use a value of \(x\) that lies outside the range of the original data (either much higher or much lower).
- Example: Using the data above to predict the score for someone who studied 100 hours (or 1 hour).
- Reliability: Predictions using extrapolation are generally unreliable (or risky). We cannot assume the relationship (the straight line) continues forever beyond the observed data range.
Analogy: Interpolation is like guessing the temperature between 9am and 5pm when you have data points for those hours. Extrapolation is guessing the temperature at midnight based only on your 9am to 5pm readings—the relationship might completely change!
V. The Impact of Coding on Correlation and Regression
Sometimes, data values are very large or very small, making calculation difficult (although modern calculators handle this fine). We use coding (linear transformations) to simplify the numbers.
A typical coding relationship looks like: \(p = \frac{x - c}{d}\) or \(p = ax + b\).
1. Effect on Correlation (PMCC)
If the variables \(x\) and \(y\) are transformed linearly (e.g., \(x' = ax + b\) and \(y' = cy + d\)), the PMCC is virtually unaffected.
Rule: The correlation coefficient \(r\) between \(x\) and \(y\) is the same as the correlation coefficient \(r\) between the coded variables \(x'\) and \(y'\), provided the scaling factors (\(a\) and \(c\)) are both positive or both negative.
In S1 curriculum terms: Unless you are specifically told one scaling factor is negative (which reverses the direction of the relationship), you assume:
Coding does not change the magnitude of \(r\). \(r_{xy} = r_{x'y'}\).
2. Effect on Regression Coefficients (\(a\) and \(b\))
The regression line does change when data is coded.
If the regression line for coded data is \(y' = A + Bx'\), you must use the coding relationship to find the original line \(y = a + bx\).
Example Scenario:
Suppose we used the coding: \(x' = 2x - 5\) and \(y' = \frac{y}{10}\).
We find the regression line \(y' = 1.5 + 4x'\).
Step-by-Step Decoding:
- Substitute the coding definitions into the coded equation: \[ \frac{y}{10} = 1.5 + 4(2x - 5) \]
- Simplify the right-hand side (RHS): \[ \frac{y}{10} = 1.5 + 8x - 20 \] \[ \frac{y}{10} = 8x - 18.5 \]
- Multiply through by the scale factor (10) to isolate \(y\): \[ y = 10(8x - 18.5) \] \[ \mathbf{y = 80x - 185} \]
This is the original regression equation (\(a = -185\), \(b = 80\)).
- PMCC (\(r\)): Stays the same (magnitude and sign).
- Regression (\(a\) and \(b\)): Changes. Must be decoded to return to original variables.
VI. Correlation vs. Causation: A Critical Distinction
This is one of the most important concepts in statistics, and examiners love to test your understanding of it!
Correlation Does Not Imply Causation
Just because two variables show a strong correlation (\(r\) is close to \(\pm 1\)), it does not necessarily mean that one causes the other.
Real-World Analogy:
Imagine a strong positive correlation is found between the number of ice creams sold and the number of crimes committed in a city over a year.
Does eating ice cream make people commit crimes? No!
The relationship is likely caused by a third variable, often called a confounding variable. In this case, the confounding variable is Temperature. Hot temperatures increase both ice cream sales and outdoor activity, which often correlates with higher crime rates.
When Can We Suggest Causation?
In Mathematics (S1), we generally cannot prove causation. We can only state that correlation exists.
However, if a strong correlation is found, and there is a logical, scientific reason or mechanism linking the two variables (e.g., hours studied and exam score), we can suggest that there might be a causal link.
Always remember: The presence of a strong \(r\) value is only evidence of an association, not proof of cause and effect.
Correlation tells us if two things move together (\(r\)). Regression tells us how they move together (\(y=a+bx\)). Always consider reliability (interpolation/extrapolation) and causation when interpreting your results!