📊 Interpreting Statistical Data: Your IGCSE Study Guide

Hey there, future statistician! Welcome to the chapter on Interpreting Statistical Data. This might sound intimidating, but statistics is simply the art of making sense of numbers. Data is everywhere—from tracking how many hours you sleep to analyzing exam scores globally—and knowing how to read and interpret it is one of the most useful skills you'll learn in mathematics.

In this section, we will learn how to organize raw data, calculate key values that summarize it (like averages), and use diagrams to see trends and relationships. We've broken down the steps to make sure you feel confident dealing with any type of data question!


1. Classifying and Tabulating Data

1.1 Types of Data: Discrete vs. Continuous (C10.3 / E10.3)

The first step in statistics is understanding what kind of numbers you are dealing with. Data falls mainly into two categories:

  • Discrete Data: This data can only take specific, fixed values, usually resulting from counting.
    Example: The number of students in a class (you can't have 25.5 students). The number of goals scored in a match.
  • Continuous Data: This data can take any value within a given range, usually resulting from measuring.
    Example: Height, weight, temperature, or time. If a ruler is accurate enough, a person's height could be 1.75 meters, 1.753 meters, or 1.7538 meters.

Quick Tip: If you have to count it, it's discrete. If you have to measure it, it's continuous.

1.2 Organizing Data (C10.1 / E10.1)

When you collect raw data, it’s usually messy. We use tables to make it organized and easier to read.

  • Tally Tables: Used to count the frequency of each item in a set. Remember, every fifth tally mark is drawn across the previous four (\(\text{||||}\)) to help with counting in groups of five.
  • Two-Way Tables: These tables are fantastic for showing the relationship between two different variables.
    Example: Showing the relationship between gender (male/female) and favorite subject (Math/Science).

Key Takeaway: Before interpreting data, know if it's discrete (counted) or continuous (measured), and organize it using tally or two-way tables.


2. Statistical Charts and Diagrams (C10.6 / E10.6)

Visualizing data helps us quickly spot trends and comparisons. You must be able to both draw and interpret these diagrams.

2.1 Common Diagrams

  • Bar Charts: Used for discrete or categorical data.
    • Bars must be the same width.
    • There must be gaps between the bars (unlike histograms, which are not required in this syllabus).
    • Composite (Stacked) Bar Charts: Show sub-categories stacked within the main bar.
    • Dual (Side-by-Side) Bar Charts: Show two sets of data next to each other for easy comparison (e.g., comparing boys' scores vs. girls' scores).
  • Pie Charts: Used to show proportions or percentages of a whole.
    • The total angle in a circle is \(360^{\circ}\).
    • To find the angle for a category:
      \[\text{Angle} = \frac{\text{Frequency of Category}}{\text{Total Frequency}} \times 360^{\circ}\]
  • Pictograms: Use images or symbols to represent data. You absolutely must include a Key to show what each symbol represents.
  • Stem-and-Leaf Diagrams: A quick way to show the shape of the data distribution while keeping the raw data values.
    • Data must be ordered (from smallest to largest leaf).
    • You must include a Key (e.g., \(2|5 = 25\)).
  • Simple Frequency Distributions: Basic tables listing categories/values and their frequencies.

2.2 Drawing Inferences and Restrictions (C10.2 / E10.2)

Interpreting data means drawing a conclusion or making an inference based on the numbers you see.

The Golden Rule: Appreciate Restrictions!

Just because you have data doesn't mean your conclusions are always perfect. You must recognize the limitations:

  • Sample Size: If you surveyed only 10 students, you cannot confidently draw conclusions about the entire school. The sample is too small.
  • Bias: If you only surveyed people outside a gym about fitness habits, your data will be biased toward fit people.
  • Correlation vs. Causation: Just because two things happen together (correlation) doesn't mean one causes the other (causation).

Did you know? Comparing two data sets usually requires comparing both their average (to see where the middle is) AND their range/spread (to see how consistent the data is).

Key Takeaway: Diagrams help visualize proportions and trends. Always state conclusions carefully, remembering that small or biased data limits reliability.


3. Measures of Central Tendency (Averages) (C10.4 / E10.4)

Averages (or measures of central tendency) tell you the typical or center value of a data set. You need to know three main types, plus their purposes.

3.1 Calculating Averages for Individual Data (Core & Extended)

This applies when data is given in a simple list or a basic frequency table (not grouped).

  1. Mode: The value that occurs most often (the most frequent).
    Purpose: Best used for non-numerical (categorical) data, like favorite colours.
    Example: Data set: 1, 3, 3, 5, 6, 6, 6. Mode = 6.
  2. Median: The middle value when the data is arranged in order (ascending or descending).
    Step-by-Step:
    1. Order the data.
    2. Find the position of the median using the formula: \(\frac{n+1}{2}\)th value, where \(n\) is the number of data points.
    3. If \(n\) is odd, the median is a single middle value. If \(n\) is even, the median is the average of the two middle values.

    Purpose: Less affected by extreme outliers, making it a reliable measure for things like house prices or salaries.
  3. Mean: The sum of all values divided by the number of values (\(n\)).
    \[\text{Mean} = \bar{x} = \frac{\sum x}{n}\]
    Purpose: Uses every piece of data, often considered the most common measure of average.

3.2 Using a Graphic Display Calculator (GDC) (C10.5 / E10.5)

Your GDC can quickly find the mean, median, and quartiles for discrete data. Make sure you know how to enter the data (especially if you are using a frequency list) and select the correct statistical calculation mode.

3.3 Estimating the Mean for Grouped Data (Extended Only: E10.4, E10.5)

For Extended students, you may encounter data organized into frequency groups (e.g., 5 < height \(\leq\) 10). Since you don't know the exact values, you must estimate the mean.

Step-by-Step Estimation:

  1. Find the Midpoint (\(m\)) of each class interval. (This is the estimate for every value in that group.)
  2. Multiply the Midpoint by the Frequency (\(f\)): Calculate \(f \times m\) for each group.
  3. Sum the \(f \times m\) column (\(\sum fm\)) and sum the frequency column (\(\sum f\)).
  4. Calculate the Estimated Mean: \[\text{Estimated Mean} = \frac{\sum fm}{\sum f}\]

Important Note for Extended: You also need to be able to identify the Modal Class, which is simply the class interval with the highest frequency.

Key Takeaway: Mean uses all values, Median finds the center point (good for outliers), and Mode is the most common. For grouped data (Extended), always use the midpoint to estimate the mean.


4. Measures of Dispersion (Spread) (C10.4 / E10.4)

Measures of dispersion tell you how spread out or varied the data is. A small spread means the data is consistent; a large spread means it is varied.

4.1 Range and Quartiles

  1. Range: The simplest measure of spread.
    \[\text{Range} = \text{Maximum Value} - \text{Minimum Value}\]
    Purpose: Quick measure, but highly sensitive to extreme outliers.
  2. Quartiles: Values that divide the ordered data into four equal parts (quarters).
    • Lower Quartile (\(Q_1\)): The value that is \(\frac{1}{4}\) (or 25%) of the way through the data.
    • Median (\(Q_2\)): The value that is \(\frac{1}{2}\) (or 50%) of the way through the data.
    • Upper Quartile (\(Q_3\)): The value that is \(\frac{3}{4}\) (or 75%) of the way through the data.

    Note: The method for finding the position of the quartiles is similar to the median. For \(Q_1\), use \(\frac{1}{4} (n+1)\)th value, and for \(Q_3\), use \(\frac{3}{4} (n+1)\)th value.

  3. Interquartile Range (IQR): Measures the spread of the middle 50% of the data.
    \[\text{IQR} = Q_3 - Q_1\]
    Purpose: Excellent measure of consistency because it ignores the extreme values (outliers) at the very ends of the data set.

Comparing Data Sets:

When asked to compare two data sets (e.g., scores from Class A and Class B), you must comment on:

  1. Central Tendency: Compare the Mean or Median. (E.g., "Class B has a higher mean score (45 vs 40), so they generally performed better.")
  2. Dispersion: Compare the Range or IQR. (E.g., "Class A has a smaller IQR (5 vs 12), so their scores were more consistent.")

Key Takeaway: The IQR is the best measure of spread because it tells you how consistent the middle bulk of the data is, without being distorted by extreme minimum or maximum values.


5. Scatter Diagrams and Correlation (C10.7 / E10.7)

A scatter diagram shows the relationship (or correlation) between two variables, typically plotted on an x-y graph.

5.1 Drawing and Interpreting Scatter Diagrams

When plotting points, they should be marked clearly, often as small crosses (\(x\)).

  • Independent Variable: Plotted on the x-axis (the variable that doesn't depend on the other).
  • Dependent Variable: Plotted on the y-axis (the variable that might be affected by the other).

5.2 Understanding Correlation

Correlation describes the type of relationship seen in the data:

Correlation Type Description Graph Appearance
Positive Correlation As the independent variable (x) increases, the dependent variable (y) also increases. Points generally slope upward from left to right.
Negative Correlation As the independent variable (x) increases, the dependent variable (y) decreases. Points generally slope downward from left to right.
Zero / No Correlation There is no clear relationship between the variables. Points are scattered randomly everywhere.

Important: The term coefficient of correlation is not required in this syllabus.

5.3 The Line of Best Fit (LOBF)

The LOBF is a single straight line drawn through the middle of the scatter points to summarize the trend. You must draw this line by eye, following these rules:

  1. It must be a single ruled line (use a ruler!).
  2. It must extend across the full data set.
  3. It should pass through the mean point (the point calculated using the mean of the x-values and the mean of the y-values).
  4. There should be a roughly even distribution of points either side of the line over its entire length.

Using the Line: Once drawn, you can use the LOBF to make predictions about values not in the data set (this is called interpolation if within the data range, or extrapolation if outside).

5.4 Linear Regression (Extended Only: E10.7.4)

For Extended students, you must use your Graphic Display Calculator (GDC) to find the equation of the straight line that best fits the data (the linear regression equation). This is usually given in the form \(y = ax + b\) or \(y = mx + c\).

Key Takeaway: Scatter diagrams show relationships. Use LOBF (passing through the mean point) to estimate trends. The closer the points are to the line, the stronger the correlation.


6. Cumulative Frequency (Extended Only: E10.8)

Cumulative frequency is used for grouped continuous data and helps us quickly find the median and quartiles graphically.

6.1 Cumulative Frequency Tables

Cumulative frequency means "running total." You create a column by adding the frequencies as you go down the list.

Plotting the Points:

  • Crucially, cumulative frequency is always plotted against the upper boundary of the class interval.
  • Plotted points should be clearly marked (e.g., small crosses, \(x\)).
  • The points are joined with a smooth curve (an ogive).

Example: If the class is \(10 \leq t < 20\) with a frequency of 5, the cumulative frequency (CF) of 5 is plotted against the upper boundary, which is \(t = 20\).

6.2 Estimating Values from the Diagram

The total frequency (\(N\)) is the maximum value on the y-axis (the top of the curve).

  • Median (\(Q_2\)): Found at \(\frac{1}{2} N\) (50% of the total frequency).
  • Lower Quartile (\(Q_1\)): Found at \(\frac{1}{4} N\) (25% of the total frequency).
  • Upper Quartile (\(Q_3\)): Found at \(\frac{3}{4} N\) (75% of the total frequency).

You draw a horizontal line from the required CF value to the curve, and then drop vertically down to read the estimated data value on the x-axis.

Interquartile Range (IQR): As always, you can estimate the IQR by calculating \(Q_3 - Q_1\).

Percentiles: You can also estimate percentiles. For example, the 80th percentile is found by reading across from \(0.80 \times N\) on the cumulative frequency axis.

Key Takeaway: Cumulative frequency is a running total plotted against the upper class boundary. Use the curve to estimate the median and quartiles, giving you a quick visual summary of the data distribution.