Study Notes: Estimation (FS2.3)
Hello and welcome to the Estimation chapter! This section of FS2 Statistics is incredibly important because it moves us beyond just calculating sample statistics and helps us make educated guesses about the entire population.
In the real world, we rarely have time or resources to measure every single person or item (the entire population). Instead, we take a sample. Estimation is the technique we use to take the information from that small sample (like the sample mean, \(\bar{x}\)) and use it to confidently predict the true, unknown value of the population parameter (like the population mean, \(\mu\)).
Don't worry if this sounds abstract—we'll break down the process into clear, manageable steps using familiar statistical concepts (like the Normal and $t$-distributions).
What is Estimation? Point vs. Interval Estimates
1. Point Estimates (Review)
In FS2.2, we learned about point estimates. A point estimate is a single value used to estimate a population parameter.
- The best point estimate for the Population Mean (\(\mu\)) is the Sample Mean (\(\bar{x}\)).
- The best point estimate for the Population Variance (\(\sigma^2\)) is the unbiased Sample Variance (\(s^2\)).
Example: If you measure the heights of 50 students in a college and the average is 170 cm, then 170 cm is your point estimate for the average height of all students in that college.
2. Interval Estimates (The Focus of FS2.3)
A point estimate is almost certainly wrong! The true population mean is probably 170.1 cm or 169.9 cm, but almost never exactly 170.0 cm.
An Interval Estimate (or Confidence Interval) gives us a range of values where the true population parameter is likely to lie.
Key Takeaway: Instead of saying "I think the mean is exactly 170 cm," we say "I am 95% confident that the true mean lies between 168 cm and 172 cm."
The Concept of Confidence Intervals (CI)
1. Defining the Confidence Interval
A Confidence Interval (CI) is an interval calculated from sample data that is likely to contain the true population parameter with a specified level of confidence.
The syllabus focuses only on CIs that are symmetrical about the mean. This means your best estimate (\(\bar{x}\)) sits exactly in the middle of the interval.
The structure is always:
CI = Point Estimate \(\pm\) Margin of Error (E)
The Margin of Error (E) captures the uncertainty due to using a sample instead of the whole population.
2. Understanding the Confidence Level
The Confidence Level (e.g., 90%, 95%, 99%) tells you how sure you are that the interval captures the true mean.
Analogy: The Fishing Net
Imagine the true population mean (\(\mu\)) is a fish in the sea. Your sample mean (\(\bar{x}\)) is where your boat is. A Confidence Interval is your fishing net.
- If you use a 90% CI (a narrow net), you might miss the fish more often (10% chance of missing).
- If you use a 99% CI (a very wide net), you are almost certain to catch the fish (only 1% chance of missing).
The wider the interval, the higher the confidence, but the less precise the information!
3. The Standard Error and Critical Value
To calculate the Margin of Error, \(E\), we need two components:
The Standard Error (\(\sigma_{\bar{X}}\))
This is the standard deviation of the sampling distribution of the mean. It measures how much the sample means are expected to vary from the population mean.
\[ \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \]
where \(\sigma\) is the population standard deviation and \(n\) is the sample size.
The Critical Value (z or t)
This value determines the width of the interval based on your chosen confidence level.
- For a 95% CI, we leave 2.5% in the upper tail and 2.5% in the lower tail.
- The critical value is the $z$ or $t$ score corresponding to that tail area.
For 95% confidence, the critical Z-value is 1.96. (You should be very familiar with this from FS1!)
Quick Review: The Margin of Error Formula
\[ E = \text{Critical Value} \times \text{Standard Error} \]
Calculating Confidence Intervals: Three Key Scenarios
The distribution (and thus the critical value) we use depends entirely on two things: Is the population variance (\(\sigma^2\)) known? and How large is the sample size (\(n\))?
Scenario 1: Normal Distribution with KNOWN Variance (\(\sigma^2\))
If we know the population variance \(\sigma^2\) (or standard deviation \(\sigma\)), we always use the Z-distribution (Normal distribution), regardless of sample size \(n\).
Formula (Z-Interval):
\[ \bar{x} \pm z \times \frac{\sigma}{\sqrt{n}} \]
Scenario 2: Large Samples (\(n \geq 30\))
If the sample size is large (generally \(n \geq 30\)), the Central Limit Theorem (CLT) guarantees that the sampling distribution of the mean is approximately Normal.
Even if \(\sigma\) is unknown, for large samples, we can substitute the sample standard deviation (\(s\)) for \(\sigma\). We still use the Z-distribution.
Formula (Z-Interval for Large Samples):
\[ \bar{x} \pm z \times \frac{s}{\sqrt{n}} \]
(Remember: The syllabus notes confirm we use the Normal approximation for large samples with known or unknown variance.)
Scenario 3: Small Samples (\(n < 30\)) with UNKNOWN Variance (\(\sigma^2\))
This is the trickiest case. If the sample is small and we do not know the population variance \(\sigma^2\), using the Normal distribution would underestimate the uncertainty.
Instead, we use the t-distribution (Student's \(t\)-distribution).
- The \(t\)-distribution is wider and flatter than the Z-distribution, giving a larger margin of error to account for the uncertainty of estimating \(\sigma\) with a small \(s\).
- It requires calculating the Degrees of Freedom (\(\nu\)): \(\nu = n - 1\).
- We look up the critical \(t\)-value in the \(t\)-distribution tables using \(\nu\) and the confidence level.
Formula (t-Interval):
\[ \bar{x} \pm t_{\nu} \times \frac{s}{\sqrt{n}} \]
⚠ Common Mistake Alert: Z vs. t ⚠
Always check the two key facts before choosing your critical value:
- Is \(\sigma\) known? Yes -> Use Z.
- Is \(\sigma\) unknown? Check \(n\). If \(n \geq 30\) -> Use Z (CLT applies). If \(n < 30\) -> Use t (extra uncertainty applies).
Inferences and Sample Size Estimation
1. Making Inferences from Confidence Intervals
Confidence intervals provide an easy way to conduct a hypothesis test for the population mean (\(\mu\)). This is sometimes called "testing by inspection."
Suppose we construct a 95% confidence interval for \(\mu\). We can then use this interval to test a null hypothesis \(H_0: \mu = \mu_0\).
The Rule:
- If the hypothesised mean (\(\mu_0\)) falls inside the confidence interval, there is no reason to reject \(H_0\) at the corresponding significance level.
- If the hypothesised mean (\(\mu_0\)) falls outside the confidence interval, we reject \(H_0\) at the corresponding significance level.
Example: If the 95% CI is [168, 172], and someone claims \(\mu_0 = 175\), since 175 is outside the interval, we reject their claim at the 5% significance level.
2. Estimating Sample Size (\(n\))
In planning research, we often need to know how large a sample is required to achieve a specific Margin of Error (\(E\)) at a given confidence level.
Since the Margin of Error is \(E = z \times \frac{\sigma}{\sqrt{n}}\), we can rearrange this formula to solve for \(n\):
\[ \sqrt{n} = \frac{z\sigma}{E} \]
\[ n = \left(\frac{z\sigma}{E}\right)^2 \]
In this context, we must use the Z-critical value because we are planning the sample size assuming the CLT will apply once the sample is collected. We must also have an estimate for \(\sigma\) (either from previous studies or a pilot sample).
Crucial Step: Rounding Up
Since the sample size \(n\) must be a whole number, you must always round up the result of the calculation. If \(n = 100.1\), you need 101 samples. Rounding down to 100 would mean you fail to achieve the required narrow width.
✓ Chapter Key Takeaways
- Goal: Use the sample mean (\(\bar{x}\)) to create a range (CI) for the true population mean (\(\mu\)).
- Formula Structure: \(\bar{x} \pm (\text{Critical Value} \times \frac{\text{St Dev}}{\sqrt{n}})\).
- Z-Test conditions: Used if \(\sigma\) is known OR if \(n \geq 30\).
- T-Test conditions: Used if \(\sigma\) is unknown AND \(n < 30\). Degrees of freedom \(\nu = n-1\).
- Inference: If the hypothesised mean \(\mu_0\) is outside the CI, reject \(H_0\).
- Sample Size: Calculate \(n = \left(\frac{z\sigma}{E}\right)^2\) and always round up to the next integer.