Correlation and Regression: Study Notes (Unit S1: Statistics 1)

Hello there! Welcome to the exciting world of Correlation and Regression. This chapter is all about understanding relationships between two different measurements, like how the amount of time you spend studying affects your exam scores. Don’t worry if statistics sometimes feels overwhelming—we are going to break down these concepts step-by-step!

What will we learn? We will learn how to visually represent relationships using diagrams, how to measure the strength of those relationships using a special number called the PMCC, and finally, how to create a mathematical line to make predictions. This skill is vital for modeling real-world data!

1. Introduction to Bivariate Data and Scatter Diagrams

What is Bivariate Data?

Bivariate Data simply means data that involves two variables. We look at pairs of observations for the same subjects.

  • Example: Measuring the height (Variable 1) and weight (Variable 2) of a group of students.
Explanatory vs. Response Variables

When we analyze relationships, we often assume one variable might influence the other.

1. Explanatory Variable (Independent, \(x\)): This is the variable we think might explain or cause changes in the other. It goes on the horizontal axis (the x-axis).

2. Response Variable (Dependent, \(y\)): This is the variable we are measuring or trying to predict. Its value depends on the explanatory variable. It goes on the vertical axis (the y-axis).

The Scatter Diagram

A Scatter Diagram is the first step in analyzing bivariate data. It plots the paired data points \((x, y)\) onto a graph.

Key Takeaway: By looking at the pattern of the dots, we can immediately estimate the type and strength of the relationship.

Interpreting Patterns in Scatter Diagrams

We look for three main features: direction, form, and strength. In S1, we mainly focus on linear relationships.

1. Positive Correlation: As \(x\) increases, \(y\) tends to increase. The dots go up and to the right.
2. Negative Correlation: As \(x\) increases, \(y\) tends to decrease. The dots go down and to the right.
3. No Correlation: There is no obvious pattern; the dots are scattered randomly.

Quick Review: The pattern tells us about the correlation. If the points form a tight, straight line, the correlation is strong.


2. Measuring Correlation: The Product Moment Correlation Coefficient (\(r\))

What is Correlation?

Correlation measures the strength and direction of the linear relationship between two variables.

The Product Moment Correlation Coefficient (PMCC)

To get a precise, numerical measure of correlation, we use the PMCC, often denoted by the letter \(r\). Your calculator will often find this value for you, but you must understand what it means!

Properties of \(r\)

The PMCC, \(r\), always falls within the range of \(-1\) to \(+1\):

$$ -1 \le r \le 1 $$

1. If \(r = +1\): Perfect positive linear correlation. All points lie exactly on a straight line sloping upwards.
2. If \(r = -1\): Perfect negative linear correlation. All points lie exactly on a straight line sloping downwards.
3. If \(r = 0\): No linear correlation.

Interpreting the Value of \(r\)

The closer \(|r|\) is to 1 (ignoring the sign), the stronger the relationship.

  • Strong Positive: \(r\) is close to +1 (e.g., \(r = 0.9\))
  • Moderate Positive: \(r\) is around 0.5 to 0.8
  • Weak Positive: \(r\) is close to 0 but positive (e.g., \(r = 0.2\))
  • Strong Negative: \(r\) is close to -1 (e.g., \(r = -0.9\))

Memory Aid: Think of \(r\) as a speedometer for relationships. 1 means full speed ahead (perfect match); 0 means stalled (no match). The sign just tells you which direction the relationship is going (up or down).

Did You Know? Correlation vs. Causation

A very important concept in statistics is the difference between correlation and causation.

Correlation means two variables move together.
Causation means one variable causes the change in the other.

Example: Ice cream sales and crime rates might show a strong positive correlation (\(r\) close to 1). Does eating ice cream cause crime? No! A lurking variable (high temperature/summer) causes both to increase.
Key Rule: Correlation does NOT imply causation.


3. Linear Regression: Finding the Line of Best Fit

The Purpose of Regression

If we establish a strong linear correlation, we want to create an equation that summarizes this relationship. This equation is called the Linear Regression Line, or the line of best fit. We use it to make predictions.

In S1, we focus on the regression line of \(y\) on \(x\). This line is used to predict the value of the response variable \(y\), given a specific value of the explanatory variable \(x\).

The Least Squares Regression Line

We use a method called Least Squares. This method finds the line that minimizes the sum of the squared vertical distances (called residuals) from every data point to the line. This gives us the "best" possible fit.

The equation for the line is:

$$ \hat{y} = a + bx $$

Where:

  • \(\hat{y}\) (read as "y-hat") is the predicted value of \(y\).
  • \(a\) is the y-intercept.
  • \(b\) is the gradient (slope) of the line.

Step-by-Step Calculation of \(a\) and \(b\)

To calculate \(a\) and \(b\), we first need to find three crucial summary statistics, often denoted by \(S_{xx}\), \(S_{yy}\), and \(S_{xy}\). These are measures of variance and covariance.

Step 1: Calculate the S-Values (Summary Statistics)

The formulas for these S-values are provided in your formula booklet (or often calculated by your calculator). We use them based on the sums of \(x\), \(y\), \(x^2\), \(y^2\), and \(xy\).

$$ S_{xx} = \sum x^2 - \frac{(\sum x)^2}{n} $$ $$ S_{xy} = \sum xy - \frac{(\sum x)(\sum y)}{n} $$ (Note: \(n\) is the number of data pairs.)

Step 2: Calculate the Gradient (\(b\))

The gradient \(b\) depends on how \(x\) and \(y\) vary together relative to how \(x\) varies by itself:

$$ b = \frac{S_{xy}}{S_{xx}} $$

Step 3: Calculate the Y-intercept (\(a\))

The regression line always passes through the mean point \((\bar{x}, \bar{y})\). We use this fact to find \(a\). (Remember: \(\bar{x} = \frac{\sum x}{n}\) and \(\bar{y} = \frac{\sum y}{n}\)).

$$ a = \bar{y} - b\bar{x} $$

Common Mistake Alert! Always use \(S_{xx}\) in the denominator for calculating \(b\). If you accidentally use \(S_{yy}\), you are calculating the gradient for the line of \(x\) on \(y\), which is usually incorrect for S1 questions asking for the line of best fit for prediction!


4. Interpretation and Limitations

Interpreting the Gradient (\(b\))

The gradient \(b\) tells us the predicted change in the response variable \(y\) for every one-unit increase in the explanatory variable \(x\).

Example: If \(x\) is "hours spent studying" and \(y\) is "exam score", and \(b = 4.5\), then we interpret this as: "For every extra hour of study, the predicted exam score increases by 4.5 points."

Interpreting the Y-intercept (\(a\))

The y-intercept \(a\) is the predicted value of \(y\) when \(x=0\).

Caution: This interpretation is only sensible if it makes sense for \(x\) to be zero in a real-world context. If \(x\) is "Adult height" and the minimum height in your data set is 150 cm, saying that \(y\) (weight) is \(a\) when height is 0 cm is meaningless! Always check if \(x=0\) is within the data range.

Using the Line: Interpolation vs. Extrapolation

Once you have your equation \(\hat{y} = a + bx\), you can use it to make predictions.

1. Interpolation (Safe Prediction): This is making a prediction for an \(x\) value that lies within the range of the original data used to create the line. These predictions are usually reliable.

2. Extrapolation (Dangerous Prediction): This is making a prediction for an \(x\) value that lies outside the range of the original data.

Why is extrapolation dangerous? We assume the linear relationship continues forever, but in reality, the relationship might curve, flatten out, or change completely once we leave the data boundaries. You must always warn against extrapolation in the exam!

Key Takeaway: Regression is a powerful predictive tool, but its accuracy relies heavily on the strength of the correlation (\(r\)) and the avoidance of extrapolation.

UNIT S1 SUMMARY CHECKLIST: Correlation and Regression

  • Can I draw and interpret a scatter diagram?
  • Can I state and interpret the properties of the PMCC, \(r\)? (Range \(-1\) to \(+1\))
  • Do I know the relationship between correlation and causation? (They are different!)
  • Can I define and calculate the components \(S_{xx}\) and \(S_{xy}\)?
  • Can I calculate the regression line of \(y\) on \(x\): \(\hat{y} = a + bx\)?
  • Can I interpret the values of \(a\) and \(b\) in context?
  • Do I understand the risks of extrapolation?