Welcome to Regression and Correlation!
Hello future statisticians! This chapter, part of Unit S3, is incredibly important because it teaches us how to find and measure relationships between two variables. Think about how the amount of sunlight affects plant growth, or how study hours relate to exam scores.
By the end of these notes, you’ll be able to:
- Visually represent data relationships using scatter diagrams.
- Calculate and interpret the strength of these relationships using the Product Moment Correlation Coefficient (PMCC).
- Find the 'line of best fit' using the Least Squares Regression method, allowing you to make predictions.
Don't worry if this seems tricky at first—we'll break down the formulas and concepts into simple, manageable steps. Let’s dive in!
1. Visualizing Relationships: Scatter Diagrams
The first step in analyzing any bivariate data (data involving two variables) is to plot it on a scatter diagram.
1.1 Independent and Dependent Variables
When you plot two variables, you need to decide which one influences the other:
- Independent Variable (x): This is the variable you control, or the one that causes change. It goes on the horizontal axis.
- Dependent Variable (y): This is the variable that changes in response to the independent variable. It goes on the vertical axis.
Example: If we study how temperature (x) affects ice cream sales (y), temperature is the independent variable.
1.2 Types of Linear Correlation
When we look at a scatter diagram, we are looking for the direction and strength of the relationship (correlation).
- Positive Correlation: As x increases, y generally increases. The points move upwards from left to right.
- Negative Correlation: As x increases, y generally decreases. The points move downwards from left to right.
- No Correlation: The points are randomly scattered, showing no clear relationship between x and y.
Quick Review: Correlation vs. Causation
Correlation means two variables move together. Causation means one variable directly causes the change in the other. Just because two things are correlated doesn't mean one causes the other!
Did you know? In summer, ice cream sales and crime rates both increase. They are correlated, but ice cream doesn't cause crime (the underlying cause is the warm weather!).
2. Measuring Correlation: The PMCC (\(r\))
While scatter diagrams show us the relationship visually, we need a precise mathematical measure. This is the job of the Product Moment Correlation Coefficient (PMCC), denoted by \(r\).
2.1 What is PMCC?
The PMCC measures the strength and direction of the linear relationship between two variables.
2.2 Interpreting the Value of \(r\)
The value of \(r\) always lies between -1 and 1:
$$ -1 \le r \le 1 $$| PMCC (\(r\)) Value | Interpretation |
|---|---|
| \(r = 1\) | Perfect Positive Correlation (All points lie exactly on an upward-sloping straight line.) |
| \(r\) near +1 (e.g., 0.8 to 0.99) | Strong Positive Correlation |
| \(r\) near 0.5 | Moderate Positive Correlation |
| \(r \approx 0\) | No Linear Correlation |
| \(r\) near -0.5 | Moderate Negative Correlation |
| \(r = -1\) | Perfect Negative Correlation |
2.3 Calculating \(r\) (The Formula Components)
While you will often use your calculator to find \(r\), it is crucial to understand the building blocks of the calculation. These are the sums of squares and products:
- \(S_{xx}\) (Sum of squares of x): Measures the spread of the x data.
- \(S_{yy}\) (Sum of squares of y): Measures the spread of the y data.
- \(S_{xy}\) (Sum of products): Measures how x and y vary together. This is the key component that determines the sign (+ or -) of the correlation.
The formula for PMCC is:
$$ r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}} $$Don't worry about calculating \(S_{xx}, S_{yy}\) and \(S_{xy}\) by hand from the raw data often—your Edexcel formula booklet provides the definitions based on sums, and your calculator handles the bulk work. But you MUST be able to use these three values to calculate \(r\).
Key Takeaway for PMCC
PMCC (\(r\)) only measures linear relationships. If the data forms a perfect curve (non-linear relationship), \(r\) might be close to zero, misleading you into thinking there is no relationship when a strong non-linear one exists!
3. Finding the Line of Best Fit: Linear Regression
If we determine that a linear relationship exists (i.e., \(r\) is close to 1 or -1), we can find the Regression Line. This line is used to model the relationship and make predictions.
3.1 The Least Squares Principle
When drawing a line of best fit, we want the line that minimizes the total error between the actual data points and the line itself. The regression line found in S3 is called the Least Squares Regression Line. It minimizes the sum of the squares of the vertical distances (residuals) from the points to the line.
3.2 The Regression Equation
The standard equation for the regression line of y on x is:
$$ y = a + bx $$Where:
- \(y\) is the dependent variable (the one you are predicting).
- \(x\) is the independent variable (the one you are using to predict).
- \(b\) is the gradient (slope) of the line.
- \(a\) is the y-intercept.
3.3 Calculating the Coefficients (\(a\) and \(b\))
To find the line, we use \(S_{xx}, S_{xy}\), and the means of the data (\(\bar{x}\) and \(\bar{y}\)).
Step 1: Calculate \(b\) (The Gradient)
The gradient \(b\) tells us how much \(y\) is expected to change for every one-unit increase in \(x\).
$$ b = \frac{S_{xy}}{S_{xx}} $$Memory Aid: The gradient \(b\) is determined by how x and y vary together (\(S_{xy}\)) relative to the spread of x (\(S_{xx}\)).
Step 2: Calculate \(a\) (The Y-Intercept)
The least squares regression line always passes through the mean point \((\bar{x}, \bar{y})\). We use this fact and our calculated \(b\) to find \(a\).
$$ a = \bar{y} - b\bar{x} $$Step-by-Step Example Process:
- Calculate the means \(\bar{x}\) and \(\bar{y}\). (Often provided, or easy via calculator).
- Calculate \(S_{xx}\) and \(S_{xy}\). (Usually provided in the question).
- Use Step 1 formula to find \(b\).
- Use Step 2 formula to find \(a\).
- Write the final equation \(y = a + bx\).
3.4 Why the Distinction Matters: y on x vs. x on y
A very common trap in Further Maths is confusing which variable predicts which. The regression line of \(y\) on \(x\) is NOT the same as the regression line of \(x\) on \(y\).
- Regression of y on x: \(y = a + bx\). Used when x is the independent variable and we want to predict y. (Minimizes vertical errors).
- Regression of x on y: \(x = c + dy\). Used when y is the independent variable and we want to predict x. (Minimizes horizontal errors).
If you are asked to predict the time spent studying (\(x\)) based on the score received (\(y\)), you must use the \(x\) on \(y\) line.
Crucial Note for Struggling Students
Always identify the dependent variable first! If you are predicting HEIGHT based on AGE, then HEIGHT is $y$ and AGE is $x$. Use $y = a + bx$. The formula for $b$ always has $S_{xx}$ in the denominator when you are predicting $y$ from $x$.
4. Using the Regression Line: Prediction and Limitations
Once you have the equation \(y = a + bx\), you can use it to estimate values.
4.1 Interpolation (Good Prediction)
Interpolation is making a prediction for the dependent variable \(y\) based on an independent value \(x\) that falls within the range of the original data used to create the line.
Example: If the original data used ages 5 to 15, predicting the height of a 10-year-old is interpolation. This is generally reliable.
4.2 Extrapolation (Bad Prediction)
Extrapolation is making a prediction for \(y\) based on an \(x\) value that falls outside the range of the original data.
Example: Using data from ages 5-15 to predict the height of a 40-year-old.
Warning! Extrapolation is dangerous because you are assuming the linear trend continues indefinitely, which is often false in real life. Relationships frequently break down or change form outside the observed range.
4.3 Reliability and Suitability
The reliability of any prediction depends on two things:
- Strength of Correlation (\(|r|\)): The closer \(|r|\) is to 1, the better the data fits the line, and the more reliable the prediction.
- Scope of Data (Interpolation vs. Extrapolation): Interpolation is usually reliable; extrapolation is usually unreliable.
Analogy: Imagine predicting your speed on a 10 km journey. If you used data from the first 5 km (interpolation), your prediction is likely good. If you used that data to predict your speed across the entire 300 km country (extrapolation), your prediction is likely wrong because the road conditions change!
Chapter Summary Checklist
- Can I interpret PMCC (\(r\)) in terms of strength and direction?
- Do I know that correlation does not imply causation?
- Can I calculate the coefficients \(a\) and \(b\) given \(S_{xx}\) and \(S_{xy}\)?
- Do I know the difference between the line of \(y\) on \(x\) and \(x\) on \(y\)?
- Can I identify if a prediction requires interpolation or extrapolation?
If you answered yes to all, you are ready for exam questions!
Good luck with your studying! You've successfully grasped the core concepts of statistical relationships.