Statistics Chapter 1: Averages and Measures of Spread
Welcome to the chapter on Averages and Measures of Spread! In Statistics, we collect massive amounts of data, but raw numbers are messy. This chapter teaches you how to summarize that data using just a few powerful numbers.
Think of it like reading a book review instead of the whole book:
- Averages (Mean, Median, Mode) tell you the "typical" or central value of the data (the main plot).
- Measures of Spread (Range, IQR) tell you how spread out the data is (how varied the characters or events are).
Section 1: Measures of Central Tendency (Averages)
The three main types of averages (also called measures of central tendency) are the Mode, Median, and Mean.
1.1 The Mode (Most Frequent)
The Mode is the value that appears most often in a data set.
It is the easiest to find and can be used for any type of data (even non-numerical data like favourite colours).
Finding the Mode: Look for the highest frequency (count).
Example: Data set: 5, 8, 8, 10, 12.
The mode is 8.
Important Note:
- A data set can have more than one mode (bimodal, trimodal, etc.).
- If all values appear only once, there is no mode.
1.2 The Median (Middle Value)
The Median is the middle value when the data is arranged in order of magnitude (size).
Step-by-Step Guide for Individual Data:
- Order the data (from smallest to largest).
- Count the total number of data points, \(n\).
- Find the position of the median using the formula: Position \( = \frac{n + 1}{2} \).
- Locate the value in that position.
Case 1: Odd number of data points (n is odd)
Example: Data set: 12, 5, 10, 8, 15. (\(n=5\))
1. Ordered data: 5, 8, 10, 12, 15.
2. Position: \(\frac{5 + 1}{2} = 3\).
3. The median is the 3rd value: 10.
Case 2: Even number of data points (n is even)
Example: Data set: 5, 8, 10, 12. (\(n=4\))
1. Position: \(\frac{4 + 1}{2} = 2.5\). This means the median lies between the 2nd and 3rd values (8 and 10).
2. Calculate the median by finding the average of these two middle values:
Median \( = \frac{8 + 10}{2} = \B{9}\).
1.3 The Mean (The Calculated Average)
The Mean is the sum of all values divided by the number of values. It is the most common average, representing the mathematical center of the data.
Formula for Mean (\(\bar{x}\)):
$$ \bar{x} = \frac{\text{Sum of all values}}{\text{Number of values}} = \frac{\sum x}{n} $$
Example: Data set: 5, 8, 10, 12. (\(n=4\))
$$ \bar{x} = \frac{5 + 8 + 10 + 12}{4} = \frac{35}{4} = \B{8.75} $$
Quick Takeaway: Choosing the Right Average
Different averages are useful for different reasons:
- Mean: Uses all data points. Best when data is balanced, but easily skewed (pulled up or down) by outliers (extreme values).
- Median: Not affected by outliers. Best when data is highly skewed (e.g., house prices or salaries).
- Mode: Best for categorical data (e.g., which shoe size is most popular).
Memory Aid: Most Often (Mode), Middle (Median), Mathematical average (Mean).
Section 2: Measures of Spread (Variability)
Measures of spread tell us how far apart the data values are from each other. Two data sets might have the same mean, but if one has a small spread and the other has a large spread, they tell very different stories!
2.1 The Range (Simple Spread)
The Range is the simplest measure of spread.
Formula:
$$ \text{Range} = \text{Largest value} - \text{Smallest value} $$
Example: Data set: 5, 8, 10, 12, 15.
Range \( = 15 - 5 = \B{10}\).
Drawback: The Range is completely determined by the two extreme values, meaning it is highly sensitive to outliers.
2.2 Quartiles and the Interquartile Range (IQR)
To get a more robust measure of spread (one that ignores outliers), we use quartiles. Quartiles divide the ordered data into four equal parts.
Quartiles:
- \(\B{Q_1}\): The Lower Quartile (25% of the data is below this value). This is the median of the lower half.
- \(\B{Q_2}\): The Median (50% of the data is below this value).
- \(\B{Q_3}\): The Upper Quartile (75% of the data is below this value). This is the median of the upper half.
The Interquartile Range (IQR) measures the spread of the middle 50% of the data.
Formula for IQR:
$$ \text{IQR} = Q_3 - Q_1 $$
Finding Quartiles for Individual Data (Step-by-Step)
1. Order the data.
2. Find the Median (\(Q_2\)).
3. The data is now split into two halves (lower and upper).
4. \(Q_1\) is the median of the lower half.
5. \(Q_3\) is the median of the upper half.
Example 1: Data set (\(n=7\)): 2, 4, 6, 8, 10, 12, 14
- \(Q_2\) (Median): 8
- Lower half (excluding 8): 2, 4, 6. \(Q_1\) (middle of lower half) = 4.
- Upper half (excluding 8): 10, 12, 14. \(Q_3\) (middle of upper half) = 12.
- IQR \( = 12 - 4 = \B{8}\).
Example 2: Data set (\(n=8\)): 1, 3, 5, 7, 9, 11, 13, 15
- \(Q_2\) (Median): Between 7 and 9. \(Q_2 = 8\).
- Lower half: 1, 3, 5, 7. \(Q_1\) (average of 3 and 5) = 4.
- Upper half: 9, 11, 13, 15. \(Q_3\) (average of 11 and 13) = 12.
- IQR \( = 12 - 4 = \B{8}\).
⚠ Common Mistake Alert ⚠
When finding quartiles, always make sure your data is ORDERED first. If you miss this step, all your quartile and median calculations will be wrong!
Section 3: Using Frequency Tables
When you have a large amount of discrete data, listing every value is impractical. We use frequency tables, where \(f\) is the frequency (how many times a value appears) and \(x\) is the data value.
3.1 Calculating Mean, Median, and Quartiles from a Frequency Table (Discrete Data)
1. Calculating the Mean:
Instead of adding up every single \(x\), we use the total frequency \(\sum f\) and the sum of (frequency times value) \(\sum fx\).
Formula for Mean from Frequency Table:
$$ \bar{x} = \frac{\sum fx}{\sum f} $$
Step-by-Step for Mean:
- Create a column for \(fx\) (multiply \(f\) by \(x\)).
- Sum the \(fx\) column (\(\sum fx\)).
- Sum the \(f\) column (\(\sum f\), this is \(n\)).
- Divide the sums.
2. Calculating the Median and Quartiles:
We still need to find the position: Position \( = \frac{n + 1}{2}\).
Use the cumulative frequency (running total of \(f\)) to locate the value in the table that holds that position.
Example: If \(\sum f = 50\).
Position of Median: \(\frac{50 + 1}{2} = 25.5\). We look for the value \(x\) that contains the 25th AND 26th data point.
Position of \(Q_1\): \(\frac{50 + 1}{4} \approx 12.75\). We look for the value \(x\) that contains the 13th data point.
Section 4: Working with Grouped Data (Extended Content)
Sometimes data is grouped into classes (e.g., 0-10, 10-20, etc.). When data is grouped, we lose the exact individual values. Therefore, we can only calculate estimates for the averages.
4.1 Estimating the Mean from Grouped Data
Since we don't know the exact values, we assume every data point in a class interval is located at the midpoint of that interval.
Step-by-Step for Estimated Mean:
- Find the midpoint (\(m\)) for each class interval.
$$ \text{Midpoint} = \frac{\text{Lower Boundary} + \text{Upper Boundary}}{2} $$ - Calculate the estimated total for each class: \(fm\) (frequency \(\times\) midpoint).
- Use the Mean Formula, replacing \(x\) with \(m\):
$$ \text{Estimated Mean} = \frac{\sum fm}{\sum f} $$
Did you know? Your Graphic Display Calculator (GDC) can calculate the mean directly from grouped frequency data by inputting the midpoints as the data values and the frequencies. (Syllabus C10.5/E10.5)
4.2 Identifying the Modal Class
When data is grouped, we cannot find the exact mode, but we can identify the Modal Class.
The modal class is simply the class interval that has the highest frequency (\(f\)).
Section 5: Cumulative Frequency (Extended Content E10.8)
Cumulative frequency helps us find the median and quartiles quickly for grouped data.
5.1 What is Cumulative Frequency?
Cumulative Frequency (CF) is a running total of the frequencies. It tells you the total number of data values up to and including the end of a particular class interval.
Example: If the frequency of Class 1 is 10, and the frequency of Class 2 is 15, the CF for Class 2 is \(10 + 15 = 25\).
5.2 Drawing and Interpreting the Cumulative Frequency Diagram (Ogive)
A cumulative frequency diagram is a graph that plots the CF against the data values.
Crucial Plotting Rule:
You must plot the Cumulative Frequency on the vertical axis against the Upper Boundary (or upper limit) of the class interval on the horizontal axis.
When joining the points, they should be joined with a smooth curve. (Remember to start the curve at the lower boundary of the first class with a CF of 0).
5.3 Estimating Measures of Averages and Spread from the CF Curve
If \(N\) is the total frequency, we can estimate key values by reading across and down from the graph:
1. The Median (\(Q_2\)):
Position: \(\frac{N}{2}\).
To find: Find \(\frac{N}{2}\) on the CF axis, read across to the curve, and then read down to the horizontal axis.
2. Quartiles (\(Q_1\) and \(Q_3\)):
Position of \(Q_1\): \(\frac{N}{4}\) (or 25% of \(N\)).
Position of \(Q_3\): \(\frac{3N}{4}\) (or 75% of \(N\)).
To find: Read across from these positions on the CF axis and down to the data axis.
3. Interquartile Range (IQR):
Calculate \( \B{IQR} = Q_3 - Q_1 \).
4. Percentiles:
A percentile represents the value below which a given percentage of observations in a group of observations falls.
Example: To find the 80th percentile, you find the position that is 80% of \(N\): \(0.80 \times N\). Read across and down from that value.
Key Takeaway: Cumulative Frequency
The CF curve allows us to estimate the middle values (Median, Quartiles, Percentiles) without using complex formulas for interpolation.
- Always plot CF against the Upper Boundary.
- The total size \(N\) is the last point on the CF axis.
- The spread measured by IQR from the CF curve is always more reliable than the Range.