Study Notes: Measures of Dispersion
Hello! Welcome to your study notes on Measures of Dispersion. Don't worry if that sounds complicated – it's just a fancy way of asking, "How spread out is the data?"
In this chapter, we'll learn how to describe and measure the 'spread' or 'consistency' of a set of numbers. This is super useful in real life, from comparing student test scores to analysing a basketball player's performance. Let's dive in!
What is Dispersion, Anyway?
Imagine two students, Alex and Ben, take five math quizzes. Their scores are:
Alex: 70, 72, 75, 73, 70
Ben: 50, 95, 60, 100, 55
If you calculate their average (mean) score, you'll find it's the same for both: 72. But are their performances really the same? Not at all!
Alex is very consistent. His scores are all clustered together. Ben's scores are all over the place – sometimes great, sometimes not so good. They are very spread out.
Dispersion is the measure of how spread out or scattered a set of data is. A low dispersion means the data points are close to each other (like Alex's scores). A high dispersion means they are far apart (like Ben's scores).
Key Takeaway
Dispersion tells us about the consistency or variability of data. It gives us a more complete picture than just looking at the average.
1. Simple Measures of Spread: Range and IQR
Let's start with the two simplest ways to measure spread.
Range
The range is the simplest measure of dispersion. It's just the difference between the highest and lowest values in a dataset.
Formula: Range = Maximum value - Minimum value
Step-by-Step Example:
Find the range of the scores: 12, 15, 7, 22, 18, 9
- Find the maximum value: 22
- Find the minimum value: 7
- Subtract: Range = 22 - 7 = 15
Strength: It's very easy to calculate!
Weakness: It can be misleading because it's only affected by the two most extreme values (outliers). For example, in Ben's scores (50, 95, 60, 100, 55), the range is 100 - 50 = 50, which is very large.
Inter-Quartile Range (IQR)
The Inter-Quartile Range (IQR) is often a better measure of spread because it isn't affected by extreme outliers. It tells you the range of the middle 50% of the data.
To find the IQR, we first need to find the quartiles.
Quick Review: The Median
The median is the middle value of a dataset when it's arranged in order. It splits the data into two equal halves.
Quartiles work in a similar way, but they split the data into four equal parts.
- Lower Quartile (Q1): The median of the lower half of the data. (25% of the data is below it)
- Median (Q2): The median of the whole dataset. (50% of the data is below it)
- Upper Quartile (Q3): The median of the upper half of the data. (75% of the data is below it)
Formula: IQR = Upper Quartile (Q3) - Lower Quartile (Q1)
Step-by-Step Example (Odd number of data points):
Find the IQR of the data: 3, 6, 7, 10, 12, 15, 16
- Order the data: It's already in order! 3, 6, 7, 10, 12, 15, 16
- Find the Median (Q2): The middle number is 10.
- Find Q1: Look at the lower half of the data (the numbers before the median): 3, 6, 7. The middle number here is 6. So, Q1 = 6.
- Find Q3: Look at the upper half of the data (the numbers after the median): 12, 15, 16. The middle number here is 15. So, Q3 = 15.
- Calculate IQR: IQR = Q3 - Q1 = 15 - 6 = 9.
Step-by-Step Example (Even number of data points):
Find the IQR of the data: 2, 5, 6, 8, 11, 14, 16, 19
- Order the data: Already in order. 2, 5, 6, 8, 11, 14, 16, 19
- Find the Median (Q2): The middle is between 8 and 11. Median = (8 + 11) / 2 = 9.5.
- Find Q1: Look at the lower half: 2, 5, 6, 8. The middle is between 5 and 6. Q1 = (5 + 6) / 2 = 5.5.
- Find Q3: Look at the upper half: 11, 14, 16, 19. The middle is between 14 and 16. Q3 = (14 + 16) / 2 = 15.
- Calculate IQR: IQR = Q3 - Q1 = 15 - 5.5 = 9.5.
Key Takeaway
Range gives a quick look at the total spread, but can be skewed by outliers. IQR measures the spread of the central 50% of the data and is more reliable when there are extreme values.
2. Visualising Dispersion: The Box-and-Whisker Diagram
A Box-and-Whisker Diagram (or box plot) is a fantastic way to see the dispersion of data at a glance. It is a visual representation of five key numbers:
The "Five-Number Summary":
- Minimum Value
- Lower Quartile (Q1)
- Median (Q2)
- Upper Quartile (Q3)
- Maximum Value
How to read a Box Plot:
- The 'box' represents the middle 50% of the data (the IQR).
- The line inside the box is the median (Q2).
- The 'whiskers' extend from the box to the minimum and maximum values.
- A wider box means a larger IQR and more spread in the middle of the data.
- A shorter whisker means the data in that quarter is less spread out.
Comparing Distributions with Box Plots
This is where box plots really shine! Let's compare the test scores of Class A and Class B.
Imagine two box plots, one for Class A and one for Class B, drawn on the same scale.
- Comparing Medians: If Class B's median line is further to the right (higher value) than Class A's, it means that on average, Class B performed better.
- Comparing Dispersion: If Class A's box is much narrower than Class B's box, it means Class A's scores are more consistent (smaller IQR). If Class B's total whisker length (range) is much longer, it means their scores have a wider overall spread.
Key Takeaway
A Box-and-Whisker Diagram is a powerful visual tool. It shows the median, quartiles, and range all in one picture, making it easy to compare the spread of different datasets.
3. The Most Powerful Measure: Standard Deviation (σ)
Don't be scared by the name or the formula! The concept is simple. The standard deviation (SD) tells us, on average, how far each data point is from the mean (the average) of the data.
A small SD means the data points are tightly clustered around the mean (high consistency).
A large SD means the data points are spread out over a wider range (low consistency).
Standard Deviation for Ungrouped Data
The formula for the population standard deviation is:
$$ \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{N}} $$Let's break that down:
- $$ \sigma $$ (sigma) is the symbol for standard deviation.
- $$ \mu $$ (mu) is the symbol for the mean of the population.
- $$ x_i $$ represents each individual data value.
- $$ N $$ is the total number of data values.
- $$ \sum $$ (sigma, again!) means "sum up everything that follows".
And one more term: Variance is simply the standard deviation squared ($$\sigma^2$$). It's the value you get *before* you take the final square root.
Step-by-Step Calculation (Ungrouped Data):
Find the standard deviation of: 2, 4, 7, 8, 9
- Step 1: Find the mean ($$\mu$$).
$$ \mu = \frac{2+4+7+8+9}{5} = \frac{30}{5} = 6 $$ - Step 2: For each data point, subtract the mean and square the result.
$$(2 - 6)^2 = (-4)^2 = 16$$
$$(4 - 6)^2 = (-2)^2 = 4$$
$$(7 - 6)^2 = (1)^2 = 1$$
$$(8 - 6)^2 = (2)^2 = 4$$
$$(9 - 6)^2 = (3)^2 = 9$$ - Step 3: Find the mean of these squared differences (this is the Variance, $$\sigma^2$$).
$$ \text{Variance} = \sigma^2 = \frac{16+4+1+4+9}{5} = \frac{34}{5} = 6.8 $$ - Step 4: Take the square root to find the standard deviation ($$\sigma$$).
$$ \sigma = \sqrt{6.8} \approx 2.61 $$
Standard Deviation for Grouped Data
When data is in a frequency table, we use a slightly different formula. We use the class mark (mid-point) of each group as our 'x' value.
The formula is: $$ \sigma = \sqrt{\frac{\sum f_i(x_i - \mu)^2}{\sum f_i}} $$ where $$f_i$$ is the frequency of each class.
The steps are similar, but you have to multiply by the frequency at the right stages. You can usually use the STAT mode on your calculator to find this much more quickly!
Key Takeaway
Standard deviation is the most detailed measure of spread. It tells you the average distance from the mean. A low SD is "good" if you want consistency; a high SD means more variation.
4. Advanced Topics (Non-Foundation)
These concepts build on what we've learned and are incredibly useful for comparing data in more complex situations.
Standard Scores (z-scores)
How can you compare an apple and an orange? Or, more realistically, a great score on an easy test vs. a good score on a hard test? Use standard scores!
A z-score tells you exactly how many standard deviations a data point is away from the mean.
Formula: $$ z = \frac{x - \mu}{\sigma} $$
Example: You score 85 on a test where the mean ($$\mu$$) was 75 and the SD ($$\sigma$$) was 5. Your z-score is:
$$ z = \frac{85 - 75}{5} = \frac{10}{5} = 2 $$
This means your score was exactly 2 standard deviations above the average. A positive z-score is above the mean, a negative z-score is below the mean, and a z-score of 0 is exactly the mean.
The Normal Distribution
Many things in the real world, like people's height or exam scores, tend to follow a pattern called the normal distribution. It looks like a symmetrical bell shape, often called a "bell curve".
In a normal distribution:
- The mean, median, and mode are all at the center.
- Most of the data is clustered around the mean.
- The further you get from the mean, the less data you find.
The standard deviation is key to understanding this. For instance, a very large percentage of the data falls within 1, 2, or 3 standard deviations of the mean. (You are NOT required to memorize the exact percentages!)
Effects of Changing Data
What happens to our measures of dispersion if we change every piece of data in the same way?
Case 1: Adding a constant 'c' to every data value.
- Example: Add 10 to every score in a dataset.
- The whole dataset just shifts up. The spread does not change!
- Effect: The range, IQR, and standard deviation remain UNCHANGED.
Case 2: Multiplying every data value by a constant 'k'.
- Example: Double every score in a dataset (k=2).
- The data not only shifts, but it also stretches out. The spread increases.
- Effect: The original measures of spread are also multiplied by |k|.
- New Range = |k| × Old Range
- New IQR = |k| × Old IQR
- New Standard Deviation = |k| × Old Standard Deviation
Key Takeaway
Standard scores help us make fair comparisons between different datasets. Data transformations have predictable effects: adding a constant changes nothing about the spread, while multiplying by a constant scales the spread by that same constant.
Did you know?
In finance, standard deviation is a key measure of risk. A stock with a high standard deviation in its price is considered volatile and risky, while one with a low SD is seen as more stable. Understanding dispersion can help you make smarter decisions!