Sampling and Estimation (P3 Statistics 2 - Paper 6)

Hello and welcome to Sampling and Estimation! This chapter is incredibly practical because it deals with how we use small pieces of information (*samples*) to make educated guesses about huge groups of things (*populations*). Don't worry if the formulas seem daunting initially; the underlying concepts are very intuitive. We are essentially learning how to be statistically safe when making predictions.

1. Populations and Samples: The Big Picture vs. The Snapshot

1.1 Key Definitions
  • Population: The entire group you are interested in studying. This could be all students in a country, all cars produced by a factory, or all apples in a harvest.
  • Sample: A small, selected sub-set of the population. We study the sample because studying the entire population is usually too expensive, time-consuming, or impossible.
  • Census: A study that attempts to collect data from every member of the population (rare in practice).

Analogy: Imagine a massive pot of soup (the Population). You take one spoonful (the Sample) to taste and determine if it needs more salt (estimating the Population Mean).

1.2 The Necessity of Randomness

For the results from your sample to be meaningful and mathematically valid, the sample must be chosen randomly.

  • Random Sample: Every member of the population must have an equal chance of being selected.
  • Why Randomness is Crucial: It ensures the sample is representative of the population and avoids bias (systematic favouritism towards certain outcomes).
1.3 Unsatisfactory Sampling Methods

You must be able to explain, in simple terms, why non-random methods are often unsatisfactory.

Example: If you want to estimate the average height of students at a large college, asking only the basketball team members would be an unsatisfactory method (biased) because they are likely taller than average, leading to an estimate that is too high.

Quick Review: Population vs. Sample

We use random samples to get unbiased estimates of population characteristics. If sampling isn't random, the results are highly unreliable.

2. The Sample Mean (\(\bar{X}\)) as a Random Variable

When you take a sample of size \(n\), you calculate its mean, \(\bar{x}\). If you took *another* random sample of size \(n\), you would get a slightly different mean, and so on. This means the Sample Mean itself is a random variable, denoted \(\bar{X}\).

2.1 Expectation of the Sample Mean

If we were to take infinite samples and average their means, what would we get?

\(E(\bar{X}) = \mu\)

This is a powerful result! It means that the sample mean is an unbiased estimator of the population mean \(\mu\). In plain English: on average, your sample mean will hit the target (the true population mean).

2.2 Variance of the Sample Mean

The variance tells us how spread out the sample means are.

\(Var(\bar{X}) = \frac{\sigma^2}{n}\)

  • \(\sigma^2\) is the population variance.
  • \(n\) is the sample size.

Important Insight: Notice that the variance is divided by \(n\). This means the larger the sample size (\(n\)), the smaller the variance of the sample mean. A larger sample gives you an estimate that is much more precise and closer to the true mean.


The Standard Deviation of the Sample Mean, known as the Standard Error (SE), is:

\(SE = \sqrt{Var(\bar{X})} = \frac{\sigma}{\sqrt{n}}\)

2.3 The Distribution of \(\bar{X}\) (The Normal Case)

If the population \(X\) itself follows a Normal distribution \(X \sim N(\mu, \sigma^2)\), then the distribution of the sample mean \(\bar{X}\) is exactly Normal:

\(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\)

3. The Central Limit Theorem (CLT)

This is perhaps the most important idea in statistics.

What if the population \(X\) is NOT Normally distributed?


The Central Limit Theorem (CLT) states that if you take a large enough sample (usually \(n > 30\) is considered 'large'), the distribution of the sample mean \(\bar{X}\) will be approximately Normal, regardless of the distribution of the original population \(X\).

If \(n\) is large, \(\bar{X} \approx N\left(\mu, \frac{\sigma^2}{n}\right)\)

Analogy: The CLT is like a statistical magic trick. No matter how weird or uneven the original population distribution looks (skewed, uniform, etc.), when you average many independent values together, the resulting distribution of averages smooths out into the predictable, bell-shaped Normal curve.

Key Takeaway (CLT):

We rely on the CLT whenever we work with large samples, as it allows us to use the Normal distribution for the sample mean, even if we don't know the exact shape of the population.

4. Unbiased Estimates

When we collect a sample, we use the sample data to estimate the unknown population parameters (\(\mu\) and \(\sigma^2\)).

4.1 Estimating the Population Mean (\(\mu\))

The unbiased estimate of the population mean \(\mu\) is simply the sample mean \(\bar{x}\):

\(\hat{\mu} = \bar{x}\)

(The notation \(\hat{\mu}\) means "estimator of \(\mu\)")

4.2 Estimating the Population Variance (\(\sigma^2\))

To get an unbiased estimate of the population variance, we need a special formula called the unbiased sample variance, \(s^2\).

\(\hat{\sigma}^2 = s^2 = \frac{1}{n-1}\left(\sum x^2 - \frac{(\sum x)^2}{n}\right)\)


Did you know? The division by \(n-1\) instead of \(n\) is called Bessel's correction. We use \(n-1\) because the sample mean \(\bar{x}\) itself is used in the calculation, which slightly restricts the variability of the sample, so dividing by \(n-1\) corrects this slight underestimation.

  • If the question gives you raw data or summarised totals (\(\sum x\) and \(\sum x^2\)), you must use the formula above to calculate \(s^2\).
  • In examination questions involving large samples, you might sometimes be given a value of \(s^2\) or \(s\) (sample variance or sample standard deviation) and be asked to treat it as if it were the true population variance \(\sigma^2\) or standard deviation \(\sigma\).

5. Confidence Intervals for the Population Mean (\(\mu\))

Instead of giving a single point estimate (\(\bar{x}\)), a Confidence Interval (CI) gives a range of values within which the true population parameter is likely to lie.

5.1 The Concept of Confidence

A 95% confidence interval means that if we were to repeat the sampling process many times, 95% of the intervals we calculate would contain the true population mean \(\mu\).

5.2 Conditions for Using the Z-Interval

We can determine a confidence interval for \(\mu\) using the standard normal (Z) distribution if one of these two conditions is met:

  1. The population is Normally distributed AND the population variance \(\sigma^2\) is known.
  2. The sample size \(n\) is large (due to the Central Limit Theorem). If \(n\) is large, we can usually use the unbiased sample variance \(s^2\) as an estimate for \(\sigma^2\).
5.3 Calculating the Confidence Interval for \(\mu\)

The general formula for the confidence interval is:

\(\bar{x} \pm z \times \frac{\sigma}{\sqrt{n}}\)

  • \(\bar{x}\) is the calculated sample mean.
  • \(z\) is the critical z-value (found from the Normal Distribution tables, based on the required confidence level).
  • \(\frac{\sigma}{\sqrt{n}}\) is the Standard Error (SE).
Example of Critical Z-Values (Z-Scores)

To find the correct \(z\)-value, you look up the confidence level in the tables (or use \(\Phi(z)\)).

  • For 90% CI: The two tails combined are 10% (5% each side). We look up \(\Phi(z) = 0.95\). \(z \approx 1.645\).
  • For 95% CI: The two tails combined are 5% (2.5% each side). We look up \(\Phi(z) = 0.975\). \(z \approx 1.960\).
  • For 99% CI: The two tails combined are 1% (0.5% each side). We look up \(\Phi(z) = 0.995\). \(z \approx 2.576\).
Step-by-Step CI Calculation
  1. Identify the known values: Sample size \(n\), sample mean \(\bar{x}\), and population standard deviation \(\sigma\) (or its estimate \(s\)).
  2. Determine the critical \(z\)-value: Look up the \(z\) corresponding to your confidence level (e.g., \(z=1.96\) for 95%).
  3. Calculate the Standard Error (SE): \(SE = \frac{\sigma}{\sqrt{n}}\).
  4. Calculate the Margin of Error (ME): \(ME = z \times SE\).
  5. Construct the interval: \(\bar{x} - ME < \mu < \bar{x} + ME\).


Example Interpretation: If a 95% CI for the average time a student spends studying is (10.5 hours, 14.5 hours), you would interpret this as: "We are 95% confident that the true average study time for all students lies between 10.5 hours and 14.5 hours."

6. Confidence Intervals for the Population Proportion (\(p\))

Sometimes we estimate the proportion of a population that has a certain characteristic (*e.g., the proportion of voters who support Candidate A*).

6.1 Conditions and Distribution

This method is only valid for large samples.

Recall that for a Binomial distribution \(B(n, p)\), if \(n\) is large, it can be approximated by a Normal distribution. The sample proportion, \(\hat{p}\) (the ratio of successes to \(n\)), can be approximated by:

\(\hat{P} \sim N\left(p, \frac{p(1-p)}{n}\right)\)

However, since we don't know the true population proportion \(p\), we use the sample proportion \(\hat{p}\) in the variance calculation.

6.2 Calculating the Confidence Interval for \(p\)

The approximate confidence interval for the population proportion \(p\) is:

\(\hat{p} \pm z \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)

Where:

  • \(\hat{p}\) is the sample proportion (calculated from the sample data).
  • \(n\) is the sample size.
  • \(z\) is the critical z-value corresponding to the confidence level.
  • \(\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\) is the estimated Standard Error for the proportion.
A Note on Continuity Correction

When approximating the Binomial or Poisson distribution using the Normal distribution (which you have seen in other chapters), we use a continuity correction. However, when calculating confidence intervals for the population mean or proportion, we do NOT use a continuity correction.

Quick Review: Confidence Intervals

The calculation relies on the Standard Error (\(\frac{\sigma}{\sqrt{n}}\) for mean, \(\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\) for proportion) and the critical Z-score. Remember the distinction between \(\sigma\) (population SD) and \(s\) (unbiased sample SD, used when \(\sigma\) is unknown but \(n\) is large).

You've now covered the essentials of using samples to estimate population parameters. Mastering these steps is vital for success in Paper 6!