👋 Welcome to the World of Scatter Diagrams!

Hi there! This chapter is all about spotting relationships in data. Don't worry if Statistics sometimes feels abstract—we are essentially learning how to be detectives who look for patterns! Scatter diagrams are one of the most visual and straightforward tools in your statistics toolkit.

What you will learn:

  • How to plot data onto a scatter diagram.
  • How to recognize and describe the relationship (correlation) between two variables.
  • How to draw and use the line of best fit to make predictions.

1. Plotting and Interpreting Scatter Diagrams (C10.7.1)

A scatter diagram (or scatter plot) is a graph used to show the relationship between two variables. We deal with what is called bivariate data (data involving two variables).

The Axes: Independent and Dependent Variables

When you plot a scatter diagram, you need to decide which variable goes on which axis:

  • Independent Variable (x-axis): This is the variable you think might influence the other. We often control or measure this first. (Example: Time spent studying.)
  • Dependent Variable (y-axis): This is the variable that changes in response to the independent variable. (Example: Exam score achieved.)

Each pair of measurements (e.g., one student's study time and their score) becomes a single point plotted on the graph.

⚠️ How to Draw Scatter Diagrams (Syllabus Requirement)

The syllabus requires specific plotting conventions:

1. Plotting Points: The plotted points should be clearly marked, for example as small crosses (x).

2. Scale and Labels: Ensure your axes are labelled clearly with the variable name and scale.

Quick Review: Think of a scatter diagram as showing all your individual data points at once, allowing you to visually see if they form a cloud or a clear trend.


2. Understanding Correlation (C10.7.2)

Correlation describes the relationship, or link, between the two variables shown on a scatter diagram. It tells you *how* the variables are connected—if they increase together, if one decreases as the other increases, or if there is no connection at all.

Types of Correlation

There are three main types of linear correlation you need to recognize:

1. Positive Correlation

  • Description: As the independent variable (x) increases, the dependent variable (y) also increases. The points generally trend upwards from left to right.
  • Real-World Analogy: The hotter the weather, the more ice cream sales there are. (Both increase.)

2. Negative Correlation

  • Description: As the independent variable (x) increases, the dependent variable (y) decreases. The points generally trend downwards from left to right.
  • Real-World Analogy: The older a car gets, the lower its value becomes. (One increases, the other decreases.)

3. Zero (or No) Correlation

  • Description: There is no apparent relationship between the two variables. The points are scattered randomly across the diagram, forming a shapeless cloud.
  • Real-World Analogy: The relationship between a person's height and their favourite colour. (No connection.)

Strength of Correlation

We also describe correlation by its strength: Strong, Moderate, or Weak.

  • Strong: The points lie very close to forming a perfect straight line.
  • Moderate: The points show a clear trend but are spread out more loosely around the potential line.
  • Weak: You can barely detect a trend; the points are widely scattered, but a general direction (positive or negative) can still be suggested.
  • Perfect: All points lie exactly on a straight line (rare in real data).
Did you know?

In statistics, correlation does not necessarily mean causation. Just because two things happen together (like high ice cream sales and high crime rates in summer), it doesn't mean one causes the other (it's often a third factor, like temperature, causing both).

Important Syllabus Note: You are required only to describe correlation (positive, negative, zero, and its strength). The numerical value (the coefficient of correlation) is not required in this syllabus.

Key Takeaway: Correlation is about direction and closeness. Positive is up, Negative is down, Zero is a cloud. The tighter the points, the stronger the correlation.


3. The Line of Best Fit (LOBF) (C10.7.3)

The line of best fit is a single straight line drawn through the middle of the scatter plot to summarize the relationship between the variables. It helps us make reasonable predictions.

Rules for Drawing the Line of Best Fit (by eye)

Drawing the LOBF accurately is crucial for getting marks. It must be drawn using a single ruled line that satisfies these conditions:

  1. It must pass through the Mean Point.
  2. It should extend across the full data set.
  3. There should be a roughly even distribution of points either side of the line over its entire length.
Step 1: Calculate the Mean Point \((\bar{x}, \bar{y})\)

The most accurate way to draw the line "by eye" is to force it to pass through the mean point (also called the centroid).

  • Calculate the mean of all x values: \(\bar{x} = \frac{\sum x}{n}\)
  • Calculate the mean of all y values: \(\bar{y} = \frac{\sum y}{n}\)
  • The mean point is \((\bar{x}, \bar{y})\). Plot this point clearly (often using a circle or a different symbol).

Analogy: The mean point is like the 'centre of balance' for your data cloud. Your ruler must pivot through this exact point.

Step 2: Position and Draw the Line
  • Place your ruler on the graph so that it passes through the mean point \((\bar{x}, \bar{y})\).
  • Adjust the angle of the ruler until you have roughly the same number of data points above the line as below it.
  • Ensure the line extends from the minimum x-value shown on the graph to the maximum x-value shown (or across the full grid).

Using the Line of Best Fit for Predictions

Once drawn, the LOBF allows you to estimate values you haven't measured:

  • Interpolation: Making a prediction within the range of the original data points. This is generally considered reliable.
  • Extrapolation: Making a prediction outside the range of the original data points (i.e., extending the line). This is less reliable because you assume the trend continues beyond the measured data.
❌ Common Mistake to Avoid

Do not simply draw a line connecting the first and last plotted point! That is highly unlikely to represent the true trend of the data. The line must minimize the total distance to *all* points.

Key Takeaway: The Line of Best Fit is an educated guess about the trend, centered precisely around the mean of the data.


4. Extended Content: The Equation of Linear Regression (E10.7.4)

For Extended students, you must know how to use your graphic display calculator (GDC) to find a mathematically precise line of best fit, known as the Linear Regression Equation.

While drawing the LOBF "by eye" gives you a good estimate, the linear regression equation gives you the single mathematically correct line that minimizes the errors (distances) between the line and every single data point.

The Linear Regression Equation

The equation found by your GDC will typically be in the form of a straight line:
$$\mathbf{y = mx + c} \quad \text{or} \quad \mathbf{y = ax + b}$$

Where:

  • m (or a) is the gradient (slope) of the line, representing the rate of change.
  • c (or b) is the y-intercept.

Using the Graphic Display Calculator (GDC)

Your GDC has built-in statistical functions to perform linear regression:

  1. Enter Data: Input your paired data values (x and y) into the statistics lists (L1 and L2).
  2. Select Regression: Choose the appropriate two-variable statistics calculation or "Linear Regression" mode (often labeled a + bx or mx + b).
  3. Read Results: The calculator will instantly provide the values for the gradient (m or a) and the intercept (c or b).
  4. Write the Equation: Substitute these values into the linear equation format.

Example: If your calculator gives you \(m = 2.5\) and \(c = 10\), the equation of the line of best fit is \(\mathbf{y = 2.5x + 10}\).

Using the Equation for Prediction

Once you have the equation, you can make predictions more accurately than reading off a graph:

Example: If the equation is \(y = 2.5x + 10\), and you want to predict the score (y) for a student who studied for \(x = 5\) hours:
$$y = 2.5(5) + 10$$ $$y = 12.5 + 10$$ $$y = 22.5$$

💡 Tip for Using the Regression Equation

Remember to round your final numerical answers correctly (usually to 3 significant figures, unless specified otherwise) only after using the equation for calculation. Do not round the gradient and intercept values used in the equation unless the question asks you to state the equation itself to a certain accuracy.

Key Takeaway (Extended): The linear regression equation is the mathematical version of the line of best fit, found quickly and precisely using the GDC.


📝 Quick Review Box: Scatter Diagrams

  • Purpose: Show the relationship (correlation) between two variables.
  • Drawing (Core/Extended): Plot points as small crosses (x).
  • Correlation: Described by Direction (Positive, Negative, Zero) and Strength (Weak, Moderate, Strong).
  • Line of Best Fit (LOBF): Must be a single ruled line passing through the mean point $(\bar{x}, \bar{y})$ with an even distribution of points on both sides.
  • Linear Regression (Extended Only): Use the GDC to find the precise equation of the LOBF (e.g., \(y = mx + c\)) for accurate prediction.

Keep practising drawing the LOBF accurately—it's often a high-scoring practical skill in the exam! You've got this!