Welcome to Unit S1: Representation and Summary of Data!

Hello future statistician! This chapter is your foundation for all things Statistics. We are moving beyond just looking at numbers; we are learning how to organize, visualize, and summarize huge datasets so that we can draw meaningful conclusions. Think of yourself as a data detective!

Don't worry if some terms seem intimidating at first. We will break down every concept, from drawing tricky histograms to calculating standard deviation, using simple steps and real-world examples. Let’s get started!

1. Understanding Data: Types and Collection

1.1 Types of Variables

When we collect data, we need to classify it. Variables are the characteristics we measure.

Quantitative Data: Data that involves numbers (quantities).

  • Discrete Data: Data that can only take specific, fixed values, usually whole numbers. It typically involves counting.
    Example: The number of cars passing a school (you can’t have 2.5 cars).
  • Continuous Data: Data that can take any value within a given range. It typically involves measuring.
    Example: Height, weight, or temperature.

Qualitative Data (Categorical): Data that describes quality or category, not measured numerically.

  • Example: Hair colour, type of car, or favourite ice cream flavour. (While important, S1 focuses heavily on quantitative data.)
✅ Quick Tip: Discrete vs. Continuous

If you need to count it, it's Discrete. If you need to measure it with a device (and could potentially add more decimal places), it's Continuous.

1.2 Methods of Data Collection: Census vs. Sample

How do we get the data we need?

  • Census: A census observes or measures every single member of a population.
    Advantage: It gives a completely accurate result (a true parameter).
    Disadvantage: It is time-consuming, expensive, and often impractical or impossible.
  • Sample: A sample observes or measures a subset of the population.
    Advantage: It is quicker, cheaper, and easier to conduct.
    Disadvantage: It may not perfectly reflect the population (results are estimates).

Key Takeaway: Understanding data type (discrete/continuous) is crucial because it dictates which diagram (like histograms) or calculation method you must use.

2. Representing Data Visually

Once collected, data needs to be displayed clearly. We will focus on three major diagrams used in S1.

2.1 Stem and Leaf Diagrams

These are great for quickly viewing the shape of a small dataset while preserving the original values.

  • Structure: The 'stem' holds the larger place values (e.g., tens or hundreds), and the 'leaf' holds the final digit.
  • Rule: Leaves must always be listed in numerical order, and a key must be included.
    Example Key: 4 | 7 means 47.
  • Back-to-Back Stem and Leaf: Used to compare two datasets side-by-side, sharing a central stem.

2.2 Box Plots (Box and Whisker Diagrams)

A box plot displays the spread of the data and helps identify outliers. It is constructed using the Five-Number Summary.

The Five-Number Summary:

  1. Minimum Value (The end of the left whisker)
  2. Lower Quartile (\(Q_1\)) (The start of the box - 25% of data is below this)
  3. Median (\(Q_2\)) (The line inside the box - 50% of data is below this)
  4. Upper Quartile (\(Q_3\)) (The end of the box - 75% of data is below this)
  5. Maximum Value (The end of the right whisker)

Each section (whisker or box segment) represents 25% of the data, regardless of how wide it looks.

2.3 Histograms: The Area Rule

This is often the trickiest representation. Histograms are used for continuous data, especially when class intervals (class widths) are unequal.

The Critical Difference: Unlike a bar chart (where height represents frequency), in a histogram, the area of the bar represents the frequency.

This means we cannot simply plot Frequency against Class Interval. We must calculate Frequency Density for the vertical axis.

Formula Alert!

$$ \text{Frequency Density} = \frac{\text{Frequency}}{\text{Class Width}} $$

Step-by-Step Guide to Drawing a Histogram:
  1. Add a column to your frequency table for the Class Width (Upper Boundary – Lower Boundary).
  2. Add a column for Frequency Density using the formula above.
  3. Plot Frequency Density on the vertical (y) axis.
  4. Plot the Class Boundaries on the horizontal (x) axis.
  5. Draw the bars. Remember there should be no gaps between bars (as the data is continuous).

Common Mistake to Avoid: Confusing Frequency Density with Frequency. If a question asks you to find the frequency from a histogram, you must calculate:
$$ \text{Frequency} = \text{Frequency Density} \times \text{Class Width} $$

Did you know? If all class widths are equal, then the shape of the histogram will look identical to a simple frequency chart. Statisticians usually only use histograms when the widths are unequal.

Key Takeaway: For Histograms, Area = Frequency. Always use Frequency Density on the y-axis, especially with unequal class widths.

3. Measures of Central Tendency (Averages)

Central Tendency measures where the 'middle' or 'typical' value of the dataset lies.

3.1 Mode, Median, and Mean

We use three main types of averages:

  1. Mode: The value that occurs most frequently.
    Best used for: Categorical (qualitative) data or to describe the most popular item.
  2. Median (\(Q_2\)): The middle value when the data is arranged in ascending order.
    Position of the Median: If \(n\) is the number of data points, the median is the value at the \((\frac{n+1}{2})\)th position.
    Best used for: Data containing extreme values (outliers), as it is less affected than the mean.
  3. Mean (\(\bar{x}\)): The sum of all values divided by the number of values. It is the most commonly used average.
    Formula for raw data: $$ \bar{x} = \frac{\sum x}{n} $$
    Best used for: Symmetrical data without extreme outliers.

3.2 Estimating Measures for Grouped Data

When data is presented in frequency tables with class intervals (e.g., 10-20, 20-30), we don't know the exact values, so we must estimate the Mean and Median.

Estimating the Mean

To calculate the mean from grouped data, we assume all the values within a class interval are represented by the midpoint (\(m\)) of that interval.

$$ \bar{x} \approx \frac{\sum (m \times f)}{\sum f} $$ Where \(m\) is the midpoint of the class, and \(f\) is the frequency.

Estimating the Median (Linear Interpolation)

For grouped continuous data, we use linear interpolation to estimate the median (\(Q_2\)) and other quartiles (\(Q_1, Q_3\)).

The Interpolation Concept: We assume the data is spread evenly across the class where the median lies. We locate the median position (\(\frac{n}{2}\) or \((\frac{n+1}{2})\) position, depending on the context/book approach, but typically \(\frac{n}{2}\) for grouped continuous data) and find its corresponding value using proportionality.

Analogy: If you know 50 people fit between 10m and 20m, and the median is the 25th person, the median is located halfway between 10 and 20 (i.e., at 15m). Interpolation formalizes this process.

Key Takeaway: Mean uses every data point but is sensitive to outliers. Median ignores outliers but requires arranging the data. For grouped data, the results are estimates using midpoints (for mean) or interpolation (for quartiles).

4. Measures of Dispersion (Spread)

Dispersion measures how spread out the data is. Two datasets can have the same mean, but vastly different spreads!

4.1 Range and Interquartile Range (IQR)

  • Range: The difference between the highest and lowest values. $$ \text{Range} = \text{Maximum value} - \text{Minimum value} $$
    Issue: Highly sensitive to outliers.
  • Interquartile Range (IQR): The difference between the Upper Quartile (\(Q_3\)) and the Lower Quartile (\(Q_1\)). $$ \text{IQR} = Q_3 - Q_1 $$
    Advantage: It describes the spread of the middle 50% of the data and is resistant to outliers.

4.2 Variance and Standard Deviation

These are the most powerful measures of spread because they consider the distance of every single data point from the mean.

Variance (\(\sigma^2\)): The average of the squared distances from the mean.

Standard Deviation (\(\sigma\)): The square root of the variance. It is preferred because it is measured in the same units as the original data.

A small standard deviation means the data is tightly clustered around the mean.
A large standard deviation means the data is widely spread out.

The Calculation Formulas (Crucial for Exams!)

The calculation is almost always done using the computational formula, derived from the sum of squares, \(S_{xx}\).

1. Sum of Squares (\(S_{xx}\)): $$ S_{xx} = \sum x^2 - \frac{(\sum x)^2}{n} $$ (Note: If calculating from a frequency table, \(\sum x^2\) becomes \(\sum f x^2\) and \(n\) becomes \(\sum f\)).

2. Variance (\(\sigma^2\)): $$ \sigma^2 = \frac{S_{xx}}{n} $$

3. Standard Deviation (\(\sigma\)): $$ \sigma = \sqrt{\frac{S_{xx}}{n}} $$

⚠️ Memory Aid: Variance Formulas

Remember the structure of \(S_{xx}\): It’s the sum of the squares, minus the square of the sum (all divided by \(n\)).

\(S_{xx}\) is often called the "numerator" of the variance calculation. Always calculate \(S_{xx}\) first!

Key Takeaway: Standard deviation is the gold standard for measuring spread. Use your calculator's statistical mode to quickly verify these values, but be prepared to show the \(S_{xx}\) formula in working.

5. Interpreting Data: Skewness and Outliers

5.1 Skewness

Skewness describes the symmetry (or lack thereof) of the distribution. It tells us whether the data trails off more slowly to the left or to the right.

  • Positive Skew (Right Skew): The tail is stretched out to the right.
    Relationship: Mode < Median < Mean. (The mean is pulled furthest in the direction of the tail).
    Analogy: Exam scores where most people score highly, but a few students drag the average down with low scores.
  • Negative Skew (Left Skew): The tail is stretched out to the left.
    Relationship: Mean < Median < Mode. (The mean is pulled furthest in the direction of the tail).
    Analogy: Housing prices where most homes are inexpensive, but a few huge mansions drag the average up.
  • Symmetrical Distribution: The data is balanced.
    Relationship: Mean \(\approx\) Median \(\approx\) Mode.

5.2 Identifying and Handling Outliers

An outlier is an observation that lies an abnormal distance from other values in the dataset. They can be genuine extreme values or errors in recording.

In S1, we have a formal rule, based on the IQR, to identify potential outliers:

A value \(x\) is an outlier if:

  1. \(x < Q_1 - 1.5 \times \text{IQR}\) (The Lower Boundary)
  2. \(x > Q_3 + 1.5 \times \text{IQR}\) (The Upper Boundary)

Effect of Outliers: Outliers have a significant effect on the Mean and the Range, but minimal effect on the Median and IQR.

When drawing a Box Plot: If an outlier is found, it is usually marked with an asterisk (\( * \)) or a cross (\( \times \)). The whiskers then extend only to the largest/smallest values that are not outliers.

Key Takeaway: Skewness tells us the shape (use the Mean-Median-Mode relationship). Outliers are mathematically defined using the \(1.5 \times \text{IQR}\) rule and must be handled carefully when calculating bounds for box plots.