Hello IGCSE Statistician! Studying Scatter Diagrams (C10.7 / E10.7)

Welcome to one of the most visual and practical topics in Statistics! In this chapter, we learn how to look at data and see if two different things are related. For example, does the amount of time you spend sleeping affect your exam score? Or does the price of ice cream change depending on the outside temperature?

Scatter diagrams are powerful tools because they let us visually determine the relationship (or lack thereof) between two variables. Don't worry if graphs aren't your favourite—this type of plotting is very straightforward!


Section 1: Drawing and Interpreting a Scatter Diagram

What is a Scatter Diagram?

A scatter diagram (or scatter graph) is a way to display bivariate data. Bivariate data simply means data involving two variables, like height and weight, or age and income.

We use a standard coordinate grid (like the ones you use for plotting linear graphs) to show these pairs of data.

Step-by-Step: How to Draw a Scatter Diagram
  1. Choose Your Axes: The two variables are usually plotted against each other.
    • The Independent Variable (the one that causes the change, or the one you control, like time spent revising) goes on the horizontal axis (\(x\)-axis).
    • The Dependent Variable (the one that is affected, like your exam score) goes on the vertical axis (\(y\)-axis).
  2. Scale and Label: Label your axes clearly and choose a suitable scale so that all your data points fit nicely on the grid.
  3. Plot the Points: For every pair of data values, plot a single point on the graph.
    • Important Rule: The plotted points should be clearly marked, typically as small crosses (\(x\)), as specified in the syllabus notes. Avoid using large dots, as they make reading the graph inaccurate.

Analogy: Think of measuring the height and shoe size of everyone in your class. Each person is one pair of data, represented by one cross (\(x\)) on the diagram.

Key Takeaway (Section 1)

A scatter diagram plots bivariate data using small crosses (\(x\)). The independent variable usually goes on the \(x\)-axis.


Section 2: Understanding Correlation

Once you have plotted your points, the fun begins! The shape formed by the points tells us about the relationship between the two variables. This relationship is called correlation.

Types of Correlation

There are three main types of correlation you must be able to recognise and describe:

1. Positive Correlation

  • Definition: As one variable increases, the other variable also tends to increase.
  • Visual Pattern: The points cluster roughly along a line sloping upwards (from the bottom left to the top right).
  • Example: The more hours you study, the higher your exam grade generally is.

2. Negative Correlation

  • Definition: As one variable increases, the other variable tends to decrease.
  • Visual Pattern: The points cluster roughly along a line sloping downwards (from the top left to the bottom right).
  • Example: The older a car is, the lower its resale value usually is.

3. Zero (or No) Correlation

  • Definition: There is no clear relationship between the two variables.
  • Visual Pattern: The points are randomly scattered all over the graph, showing no trend or direction.
  • Example: A student's hair colour compared to their mathematics grade.

The Strength of Correlation

We also need to describe how *strong* the relationship is. This refers to how closely the points cluster around the imaginary line of best fit.

  • Strong Correlation: The points are very close to forming a perfect straight line.
  • Weak Correlation: The points are spread out but still show a general direction (upwards or downwards).
  • Zero Correlation: No direction at all.

Memory Tip: Think of a straight road. If the points are the cars:
- Strong correlation: All cars are perfectly in their lanes.
- Weak correlation: Cars are mostly on the road, but some are drifting into the hard shoulder.

Accessibility Feature: Common Pitfall

Do not confuse correlation with causation! Just because two variables are correlated doesn't mean one *causes* the other.
Example: Ice cream sales increase alongside shark attacks in summer. This is a strong positive correlation, but ice cream doesn't *cause* shark attacks. They are both caused by a third factor: hot weather.

Key Takeaway (Section 2)

Correlation describes the relationship between variables: Positive (both increase), Negative (one increases, one decreases), or Zero (no pattern). It can be strong or weak.


Section 3: The Line of Best Fit (LOBF)

If we see a strong or weak linear correlation (positive or negative), we draw a straight line that best represents this trend. This is called the Line of Best Fit (LOBF).

The LOBF is used for making predictions.

Drawing the Line of Best Fit (By Eye)

You must draw the line using a single ruled line. This is a skill tested in the exam, and there are specific rules to follow to ensure your line is accurate:

1. Follow the Trend: The line must clearly follow the pattern of the points (sloping up for positive, down for negative).

2. Balance the Points: You must ensure there is a roughly even distribution of points above the line and below the line across its entire length. If you have 10 points, aim for 5 above and 5 below (or 4/6, etc.).

3. Extend Fully: The line should extend across the full data set. It should not stop halfway through your points.

Did you know? Technically, the line of best fit should pass through the mean point \((\bar{x}, \bar{y})\) (the mean of the \(x\)-coordinates and the mean of the \(y\)-coordinates). While you usually draw it by eye, keeping the balance ensures it passes close to this central point!

Using the Line of Best Fit for Predictions

Once the LOBF is drawn, you can use it to estimate values you don't have data for. This is called interpolation or extrapolation.

1. Interpolation

This is when you use your line to make predictions within the range of your original data points.

  • Example: If your scatter diagram shows the height of 10- to 15-year-olds, interpolating means estimating the height of a 12-year-old.
  • Reliability: Interpolation is usually reliable because the line is based on existing data in that area.
2. Extrapolation

This is when you extend your line (if necessary) to make predictions outside the range of your original data points.

  • Example: Using the data of 10- to 15-year-olds to predict the height of a 25-year-old.
  • Reliability: Extrapolation is often unreliable! The relationship might change drastically outside the range you measured (e.g., people stop growing taller eventually).

Quick Review: Line of Best Fit Rules

1. Ruled Line? Yes, always straight.
2. Balanced? Yes, roughly equal points above and below.
3. Full Coverage? Yes, extends across the entire plot range.

Key Takeaway (Section 3)

The LOBF is a straight ruled line that balances the data points. Use it for reliable predictions (interpolation) within the data range, but be cautious of unreliable predictions (extrapolation) outside the range.


Section 4: Extended Content (E10.7.4) – Linear Regression

For Extended students (and often in practical applications), drawing the LOBF by eye can be subjective. To get the most accurate line possible, we use a calculated method called Linear Regression.

Finding the Equation of Linear Regression using a Graphic Display Calculator (GDC)

The GDC helps you find the exact equation of the line of best fit. This equation is usually given in the form:

$$y = ax + b$$

where \(a\) is the gradient and \(b\) is the \(y\)-intercept.

Using your GDC, you input all the paired data points (\(x\), \(y\)). The calculator then performs a complex calculation to find the values for \(a\) and \(b\) that produce the statistically optimal line of best fit.

Why use the Equation?

Once you have the equation \(y = ax + b\), you can:

  • Predict accurately: Instead of reading a value off a hand-drawn line, you can substitute an \(x\) value into the equation to get the most accurate predicted \(y\) value.
  • Interpret the slope (\(a\)): If \(a = 3\), it means that for every 1 unit increase in \(x\), \(y\) increases by 3 units.

Note: You are expected to use your GDC to find this equation and then use it to make predictions. You are not required to manually calculate the linear regression formula.

Key Takeaway (Section 4)

Extended students must use the GDC to find the accurate equation of the linear regression line (\(y = ax + b\)), which is the statistically best line of best fit, and use it for predictions.