🚀 Inference Using Normal and t-Distributions: Study Notes (9231 Further Statistics)

Welcome to one of the most powerful and practical chapters in Further Statistics! This section is all about making robust decisions and predictions about large populations, even when we only have small chunks of data (samples).

In your previous studies, you used the Normal distribution (the Z-test) extensively, usually assuming you knew the population variance (\(\sigma^2\)) or that your sample was large. In the real world, we often don't know \(\sigma^2\), and our data sets might be tiny. This is where the t-distribution saves the day!

You will learn how to choose the correct test (Z or t) and use it to test hypotheses and construct confidence intervals regarding population means.


1. Quick Review: When to Use Z vs. T

The choice between using the Normal distribution (Z-test) and the t-distribution depends on two things: sample size (\(n\)) and whether you know the population variance (\(\sigma^2\)).

The Normal Distribution (Z-Test): The Gold Standard

We rely on the standard Normal distribution (Z) when we can be highly confident in our knowledge of the population parameters.

  • Case 1: Population Variance is KNOWN (regardless of sample size \(n\)).
  • Case 2: Sample Size is LARGE (\(n \ge 30\)), even if the population variance is unknown. Why? Because of the Central Limit Theorem; when \(n\) is large, the sample variance (\(s^2\)) is a very good estimate for the population variance (\(\sigma^2\)), and the sampling distribution becomes essentially Normal.

The t-Distribution: The Small Sample Specialist

The t-distribution is used when we are dealing with genuine uncertainty.

  • Case 3: Population Variance is UNKNOWN AND Sample Size is SMALL (\(n < 30\)).

💡 Analogy: Think of the Z-test as a sharp, precise knife. You use it when you know exactly what you're cutting. The t-distribution is a slightly duller knife; it gives you a wider margin of error because you're less certain about the precision of your estimate.

Key Takeaway: If \(\sigma^2\) is unknown and \(n\) is small, you must use the t-distribution. This is the main focus of 9231 inference for means.

2. Understanding the t-Distribution

a) What is the t-distribution?

The t-distribution (or Student's t-distribution) is similar to the Normal distribution: it is symmetrical and bell-shaped, centred at zero. However, it is generally flatter and has heavier tails than the standard Normal (Z) distribution. This accounts for the extra uncertainty introduced when estimating \(\sigma^2\) using the sample variance \(s^2\).

b) Degrees of Freedom (\(\nu\))

The shape of the t-distribution changes based on its degrees of freedom (\(\nu\)).

  • For a single sample of size \(n\), the degrees of freedom are always: \(\nu = n - 1\).
  • As \(\nu\) increases (i.e., as the sample size \(n\) gets larger), the t-distribution becomes narrower and approaches the standard Normal (Z) distribution.

Why \(n-1\)? We use up one piece of information (one degree of freedom) when we calculate the sample mean (\(\bar{x}\)) to estimate the variance. If you know \(\bar{x}\) and \(n-1\) values, the last value is fixed. Hence, only \(n-1\) values are "free to vary."


3. Inference for a Single Population Mean (\(\mu\))

We use the t-test when we have a small sample (\(n < 30\)) drawn from a Normal population with unknown variance (\(\sigma^2\)).

a) Hypothesis Test (t-test)

The test statistic \(T\) measures how many standard errors the sample mean (\(\bar{x}\)) is away from the proposed population mean (\(\mu_0\)) specified in the null hypothesis \(H_0\).

Step 1: Formulate Hypotheses

Example: \(H_0: \mu = 50\) vs. \(H_1: \mu \ne 50\) (two-tailed).

Step 2: Calculate the Unbiased Estimate of Variance (\(s^2\))

The population variance \(\sigma^2\) is estimated by the unbiased sample variance \(s^2\). If you are given raw data, you must calculate this first:

$$s^2 = \frac{1}{n-1} \sum (x - \bar{x})^2$$

Step 3: Calculate the Test Statistic T

$$T = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$$

where \(s / \sqrt{n}\) is the estimated standard error of the mean.

Step 4: Determine Degrees of Freedom and Critical Value

Degrees of Freedom: \(\nu = n - 1\). Look up the critical t-value in the statistical tables (MF19, page 41) based on \(\nu\) and the significance level (\(\alpha\)).

Step 5: Conclusion

Compare the calculated \(T\) with the critical value, or compare the p-value with \(\alpha\). If \(|T| > t_{\text{crit}}\), reject \(H_0\).

b) Confidence Interval (CI) for \(\mu\)

A confidence interval gives a range of values within which the true population mean \(\mu\) is likely to lie. Since \(\sigma^2\) is unknown and \(n\) is small, we use the t-distribution critical value.

The CI formula for \(\mu\) is:

$$\bar{x} \pm t_{\text{crit}} \times \frac{s}{\sqrt{n}}$$

  • \(\bar{x}\) is the sample mean.
  • \(t_{\text{crit}}\) is the critical t-value found using \(\nu = n - 1\) and the desired confidence level (e.g., for a 95% CI, use the 0.975 column in the MF19 t-table).
  • \(\frac{s}{\sqrt{n}}\) is the estimated standard error.
Common Mistake Alert! Always remember that when finding a 95% confidence interval, you use the t-table value corresponding to \(p = 0.975\) for a two-tailed test (since \(1 - 0.95 = 0.05\), and \(0.05/2 = 0.025\), so \(p = 1 - 0.025 = 0.975\)).

4. Inference for the Difference Between Two Population Means (\(\mu_1 - \mu_2\))

This section is critical and requires you to correctly identify whether the samples are independent or dependent (paired), and whether the sample sizes are large or small.

Scenario A: Large Samples (Z-test)

If both sample sizes \(n_1\) and \(n_2\) are large (\(\ge 30\)), we use the Normal (Z) distribution, substituting the sample variances \(s_1^2\) and \(s_2^2\) for \(\sigma_1^2\) and \(\sigma_2^2\).

Test Statistic:

$$Z = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

Confidence Interval:

$$(\bar{x}_1 - \bar{x}_2) \pm Z_{\text{crit}} \times \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$

Scenario B: Small Independent Samples (2-Sample t-test)

We use this test when we have two small, independent samples from Normal populations, and we assume the two populations have the same unknown variance (\(\sigma_1^2 = \sigma_2^2 = \sigma^2\)).

Step 1: Calculate the Pooled Estimate of Variance (\(s_p^2\))

Since we assume the variances are the same, we "pool" the data to get a single, better estimate of this common variance. The formula provided in MF19 (page 39) is:

$$s_p^2 = \frac{\sum(x_1 - \bar{x}_1)^2 + \sum(x_2 - \bar{x}_2)^2}{n_1 + n_2 - 2}$$

Note: \(\sum(x - \bar{x})^2\) is often referred to as the Sum of Squares, SS. This formula is essentially \((SS_1 + SS_2) / (n_1 + n_2 - 2)\).

Step 2: Calculate the Test Statistic T

$$T = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$$

Step 3: Degrees of Freedom (\(\nu\))

When pooling two independent samples, the degrees of freedom are the sum of the individual degrees of freedom:

$$\nu = (n_1 - 1) + (n_2 - 1) = n_1 + n_2 - 2$$

Step 4: Confidence Interval for \(\mu_1 - \mu_2\)

$$(\bar{x}_1 - \bar{x}_2) \pm t_{\text{crit}} \times \sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$$

Don't worry if this seems tricky at first! The hard part is calculating \(s_p^2\). Once you have that, the rest is plugging values into the standard t-formula using the appropriate degrees of freedom.

Scenario C: Paired Sample t-test (Dependent Samples)

This test is used when measurements are naturally linked (e.g., comparing a 'before' score and an 'after' score for the same group of people, or comparing identical twins).

The clever trick here is that we don't treat this as two separate groups; we treat the difference between the paired scores as a single sample.

Steps for Paired t-test:

  1. Calculate the difference \(d\) for every pair (\(d = x_1 - x_2\)).
  2. Calculate the mean difference \(\bar{d}\) and the standard deviation of the differences \(s_d\).
  3. The hypothesis test reduces to a single sample t-test on the differences, testing if the mean difference \(\mu_d\) is zero (i.e., \(H_0: \mu_d = 0\)).

Test Statistic:

$$T = \frac{\bar{d} - 0}{s_d / \sqrt{n}}$$

where \(n\) is the number of pairs.

Degrees of Freedom:

$$\nu = n - 1$$

Confidence Interval for \(\mu_d\):

$$\bar{d} \pm t_{\text{crit}} \times \frac{s_d}{\sqrt{n}}$$

Choosing the Right Test (Summary Aid):
  • Single Sample, \(\sigma^2\) Known or \(n\) Large: Z-Test (Normal)
  • Single Sample, \(\sigma^2\) Unknown, \(n\) Small: T-Test (\(\nu = n-1\))
  • Two Samples, Independent, \(n\) Large: Z-Test (Normal approximation)
  • Two Samples, Independent, \(n\) Small: 2-Sample T-Test (Pooled Variance, \(\nu = n_1 + n_2 - 2\))
  • Two Samples, Dependent/Paired: Paired T-Test (Single sample on differences, \(\nu = n-1\))

5. Working with Confidence Intervals (CI)

Confidence intervals are simply the complement to hypothesis testing. If the hypothesized mean (\(\mu_0\) or \(\mu_1 - \mu_2 = 0\)) falls outside the calculated confidence interval, then you would reject the null hypothesis at the corresponding significance level.

a) Determining Appropriate CI Formulae

The process for finding the CI mirrors the hypothesis test selection:

  1. Identify the parameter: Are you estimating a single mean (\(\mu\)) or a difference in means (\(\mu_1 - \mu_2\))?
  2. Identify the distribution: Is it Normal (\(Z\)) or t (\(T\))? (Based on \(n\) size and knowledge of \(\sigma^2\)).
  3. Identify the critical value: Look up \(Z_{\text{crit}}\) (from the critical values section of the normal table) or \(t_{\text{crit}}\) (from the t-table using the appropriate degrees of freedom).

The Structure of ANY Confidence Interval is:

$$(\text{Point Estimate}) \pm (\text{Critical Value}) \times (\text{Standard Error})$$

b) Interpreting the Confidence Interval

A 95% confidence interval does NOT mean there is a 95% chance that the true population mean is within that specific interval. Instead, it means that if we repeated the sampling process many times, 95% of the confidence intervals constructed would contain the true population mean.

Example: If a 95% CI for the difference in means between Method A and Method B is (2.5, 4.1), since this interval does not include zero, we can be 95% confident that Method A is better than Method B.

c) Did you know?

The t-distribution was first developed by William Sealy Gosset in 1908. He worked for Guinness brewery and used small samples to monitor the quality of the beer. Since company rules prevented him from publishing under his own name, he published under the pseudonym "Student," leading to the term "Student's t-distribution."

Quick Review: The Essential Ingredients

To succeed in this chapter, ensure you can quickly identify three things in any problem:

  1. Is the sample small or large?
  2. Is \(\sigma^2\) known or unknown?
  3. Are the samples independent or paired?

Answering these questions immediately points you to the correct test statistic (Z or T) and the correct degrees of freedom (\(\nu\)).

Keep practising selecting the right procedure, and you'll master this topic!