AS & A Level Mathematics 9709 (P5) Study Notes: Representation of Data (5.1)
Welcome to the first chapter of Probability & Statistics 1! This section is all about turning raw, messy numbers into clear, insightful stories. Statistics starts here: if you can't visualize your data and understand its key features, you can't analyze it correctly.
Don't worry—this chapter uses visual aids and straightforward calculations. It’s the essential foundation for everything that comes next!
Section 1: Visualizing Data Distribution
1.1 Stem-and-Leaf Diagrams (Stemplots)
The stem-and-leaf diagram is a simple yet powerful way to organize data, especially smaller data sets.
- Advantage: It retains the original raw data values (unlike histograms or box plots).
- Structure: Data is split into a 'stem' (usually the first digit or digits) and a 'leaf' (usually the last digit).
- Crucial Requirement: Always include a Key to explain how the stem and leaf represent the actual number. (e.g., Key: 3 | 2 means 32).
Back-to-Back Stem-and-Leaf Diagrams
These are used specifically to compare two related data sets (e.g., test scores of Boys vs. Girls, or performance before vs. after a change).
- The common stem is placed in the middle.
- Leaves for one set extend to the right, and leaves for the other set extend to the left.
- Important Tip: When ordering the leaves on the left side, they must increase outwards from the stem.
Key Takeaway: Stem-and-leaf diagrams are best for comparison and when retaining the original data detail is important.
1.2 Box-and-Whisker Plots (Box Plots)
A box plot displays the five-number summary of a data set, giving an immediate view of its spread and skewness. They are excellent for comparing distributions.
The five-number summary consists of:
- Minimum Value (Lowest whisker end)
- Lower Quartile (\(Q_1\)) (Start of the box)
- Median (\(Q_2\)) (Line inside the box)
- Upper Quartile (\(Q_3\)) (End of the box)
- Maximum Value (Highest whisker end)
Interpreting the Plot:
- The box represents the Interquartile Range (IQR), which contains the central 50% of the data. (IQR = \(Q_3 - Q_1\))
- Each quartile (each section of the plot) contains 25% of the observations.
- If the median line is closer to \(Q_1\), the data is skewed positively (tail to the right). If it's closer to \(Q_3\), it's skewed negatively (tail to the left).
Quick Memory Aid: A box plot tells you everything about the quartiles, but nothing about the exact data values.
1.3 Histograms (The Area Rule)
Histograms are used for continuous data (or discrete data grouped into intervals). This is arguably the most common mistake area for students.
The Golden Rule of Histograms:
The AREA of the bar represents the FREQUENCY (or total number of observations).
When the class intervals (width) are unequal, you must use Frequency Density for the height of the bar.
Formula:
\( \text{Frequency Density} = \frac{\text{Frequency}}{\text{Class Width}} \)
Example: If a class interval is 10-20 (width 10) and has a frequency of 50, the Frequency Density is \( \frac{50}{10} = 5 \). If the next class interval is 20-25 (width 5) and has a frequency of 30, the Frequency Density is \( \frac{30}{5} = 6 \).
Did you know? If all class widths are equal, the frequency density is directly proportional to the frequency, so you can often just plot Frequency on the vertical axis (but using FD is always safer!).
Key Takeaway: For Histograms, think AREA = FREQUENCY. Always calculate Frequency Density if the class widths are not uniform.
1.4 Cumulative Frequency Graphs (Ogive)
A cumulative frequency graph (often called an Ogive) plots the running total of frequencies. It is essential for estimating positional measures.
- Calculation: You calculate the running total of frequency up to the end of each class interval.
- Plotting: Plot the Upper Class Boundary against the Cumulative Frequency. (Plotting against the upper boundary ensures you account for all data up to that point).
Estimating Positional Measures:
The total frequency \(n\) determines the position of the estimates:
- Median (\(Q_2\)): Locate \( \frac{n}{2} \) on the cumulative frequency axis, read across, and then down to the data axis.
- Lower Quartile (\(Q_1\)): Locate \( \frac{n}{4} \) (or 25% of \(n\)).
- Upper Quartile (\(Q_3\)): Locate \( \frac{3n}{4} \) (or 75% of \(n\)).
- Percentiles: A Pth percentile is found at \( \frac{P \times n}{100} \).
- Proportions: You can estimate the number or proportion of the distribution above or below a certain value \(x\). (e.g., To find the number of people scoring above 60 marks, find the cumulative frequency at 60 and subtract this number from the total frequency \(n\).)
Key Takeaway: Cumulative frequency graphs deal with position and estimation, not direct data values. Always use upper class boundaries when plotting.
Section 2: Measures of Central Tendency (Averages)
Central tendency measures where the 'center' of the data lies. Choosing the right one depends on the nature of the data and the presence of extreme values (outliers).
2.1 Mode
The Mode is the value that occurs most frequently.
- Best Use: For qualitative or categorical data (e.g., favorite colours).
- Drawback: A data set can have no mode, or multiple modes (bimodal, multimodal).
2.2 Median
The Median is the middle value when the data is arranged in ascending order.
- Position: If there are \(n\) data points, the position of the median is \( \frac{n+1}{2} \).
- Best Use: When the data contains outliers or is skewed, as the median is not affected by extreme values.
2.3 Mean (\(\bar{x}\))
The Mean is the arithmetic average. It is the most common measure of central tendency because it uses every piece of data.
Calculation Formulas:
a) Ungrouped Data:
\( \bar{x} = \frac{\sum x}{n} \)
b) Grouped Data: (Using class midpoints \(m\), or \(x\) as per the MF19 formula sheet, and frequency \(f\))
\( \bar{x} = \frac{\sum fx}{\sum f} \)
Best Use: For symmetrical distributions without severe outliers.
Key Takeaway: The Mean is sensitive to outliers; the Median is resistant to them. Always consider the data type when selecting an average.
Section 3: Measures of Variation (Spread)
Measures of variation tell us how spread out the data is. A high variation means the data is inconsistent.
3.1 Range and Interquartile Range (IQR)
- Range: Max value - Min value. (Simple, but heavily affected by outliers.)
- Interquartile Range (IQR): \( Q_3 - Q_1 \). (More robust than the range, as it only measures the spread of the central 50% of the data, ignoring extremes.)
Analogy: Think of the IQR as the "safe zone" where most of the predictable action happens.
3.2 Standard Deviation (\(\sigma\) or $s$) and Variance
The Standard Deviation (SD) measures the typical distance of any data point from the mean. It is the square root of the Variance.
In exams, you will primarily use the computation formulas derived from totals \( \sum x \), \( \sum x^2 \), or coded equivalents.
Formulas for Variance (MF19):
a) Ungrouped Data:
\( \text{Variance} = \frac{\sum x^2}{n} - \bar{x}^2 \)
\( \text{Standard Deviation} = \sqrt{\frac{\sum x^2}{n} - \bar{x}^2} \)
b) Grouped Data:
\( \text{Variance} = \frac{\sum fx^2}{\sum f} - \bar{x}^2 \)
\( \text{Standard Deviation} = \sqrt{\frac{\sum fx^2}{\sum f} - \bar{x}^2} \)
(Note: For grouped data, \(x\) represents the midpoint of the class interval.)
Important Point: Examiners often provide \( \sum x \) and \( \sum x^2 \), or request you to calculate them. Be careful not to confuse \( (\sum x)^2 \) (sum of x, then squared) with \( \sum x^2 \) (x squared, then summed).
Common Mistake Alert: When dealing with grouped continuous data, ensure you use the correct class boundaries (e.g., if data is given as 10-19, the true boundaries are 9.5 to 19.5, and the midpoint is 14.5).
3.3 The Power of Coding Data
Sometimes, data values are very large or inconveniently sized. We use a linear transformation (coding) to simplify calculations.
Let the original variable be \(X\), and the coded variable be \(Y\), defined by:
\( Y = \frac{X - a}{b} \) or \( X = a + bY \)
Where \(a\) is the assumed mean (subtraction) and \(b\) is the scaling factor (division).
Effect of Coding on Mean and Standard Deviation:
1. Effect on Mean (\(\bar{x}\)):
- The mean is affected by both subtraction (a) and division (b).
- To decode the mean: \( \bar{x} = a + b\bar{y} \)
2. Effect on Standard Deviation (\(SD\)) and Variance:
- Subtraction (a) has NO effect on spread. Moving the whole data set doesn't change how spread out the points are.
- Division (b) affects spread. If you halve the values, the spread halves.
- To decode the SD: \( SD_x = b \times SD_y \)
- To decode the Variance: \( \text{Var}(X) = b^2 \times \text{Var}(Y) \)
Memory Trick:
Mean/Average: Affected by Adding, Subtracting, Multiplying, Dividing (All operations).
Spread (SD/Variance): Only affected by Multiplying and Dividing. (Think: MAD).
Key Takeaway: Coding simplifies calculations. Always remember to "un-code" your final mean and SD/Variance back into the original context of the problem using the transformation \( X = a + bY \).
Chapter Review: Essential Checklist
To ace Representation of Data, ensure you can do the following:
- Sketch/interpret all four main plots: Stem-and-Leaf, Box Plots, Histograms (using FD), and Cumulative Frequency Graphs.
- Estimate quartiles and percentiles from a cumulative frequency graph.
- Calculate the Mean and Standard Deviation for both ungrouped and grouped data using the efficiency formulas involving \( \sum x^2 \).
- Handle and interpret data where coding has been used, correctly converting coded results back to original units.
Keep practicing your histogram calculations—they are the most technical part of this chapter! You've got this!