Welcome to Unit S3: Estimation, Confidence Intervals, and Tests!
Hello future statistician! This chapter is where we move from just describing data to making powerful, informed guesses about entire populations. This is the heart of statistical inference—using a small sample to draw big conclusions.
Why is this important? Whether you are working in finance, medicine, or quality control, you rarely have access to *all* the data. We use the tools in this chapter to determine the true values (like the average lifespan of a product or the proportion of people who prefer a certain brand) with a specific level of certainty. Mastering these concepts is crucial for both your exams and real-world analytical work!
1. Point Estimators and Unbiased Estimation
In statistics, a point estimator is simply a statistic (a function of the sample data) that we use to estimate the value of an unknown population parameter (like the population mean, \(\mu\), or variance, \(\sigma^2\)).
What is an Unbiased Estimator?
When we estimate a population parameter, we want our estimate to be 'fair'. An estimator is unbiased if, on average, the estimates it produces across many samples equal the true value of the parameter.
Think of it like target practice. If your estimator is unbiased, even if individual shots (samples) miss the bullseye (the true parameter), the average position of all your shots is exactly the bullseye.
The key estimators you need to know are:
- Population Mean (\(\mu\)): Estimated by the Sample Mean, \(\bar{X}\). (\(\bar{X}\) is an unbiased estimator of \(\mu\)).
- Population Proportion (\(p\)): Estimated by the Sample Proportion, \(\hat{p}\). (\(\hat{p}\) is an unbiased estimator of \(p\)).
The Critical Case: Estimating Variance
This is where students often make mistakes. We have two main ways to calculate variance from a sample, but only one is an unbiased estimator of the population variance, \(\sigma^2\).
The Biased Estimator (\(S^2\))
This is the standard formula you learned in earlier statistics, dividing the sum of squared differences by \(n\): $$S^2 = \frac{\sum (x_i - \bar{x})^2}{n}$$ If you use \(S^2\) to estimate the population variance \(\sigma^2\), your estimate will generally be too small. It is a biased estimator.
The Unbiased Estimator (\(\hat{\sigma}^2\))
To get an unbiased estimate of the population variance, we adjust the denominator by using the degrees of freedom, \(n-1\): $$\hat{\sigma}^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}$$ The term \(\hat{\sigma}^2\) is the standard notation for the unbiased estimator of the population variance. It is also often called the sample variance in A Level notation.
Memory Aid: Why \(n-1\)?
When calculating variance, you first use the sample mean, \(\bar{x}\), to find the deviations. Because you used the data itself to calculate \(\bar{x}\), you "spent" one piece of information, leaving you with only \(n-1\) pieces of *free* information (degrees of freedom). Dividing by \(n-1\) corrects this inherent underestimation.
Key Takeaway on Estimators: Always use \(\bar{X}\) for \(\mu\), \(\hat{p}\) for \(p\), and divide by \(n-1\) for the best (unbiased) estimate of \(\sigma^2\).
2. Confidence Intervals: What Are We Really Saying?
A confidence interval (CI) is a range of values within which we believe the true population parameter lies, calculated with a certain degree of certainty (the confidence level).
The Fishing Analogy
Imagine you are trying to catch a specific fish (the true population mean, \(\mu\)) in a massive lake. You can't see the fish, but you can throw a net (your confidence interval) based on the small amount of water you sampled.
- Your Net Size: This is determined by the required level of confidence (e.g., 95% or 99%). A 99% CI (a bigger net) gives you a better chance of catching the fish, but the interval is wider (less precise).
- The Fish: The population parameter (\(\mu\)). It is fixed, but unknown.
If you construct 100 confidence intervals (i.e., take 100 samples and build 100 nets), a 95% confidence level means that 95 of those nets will successfully capture the true population mean \(\mu\). The other 5 intervals will miss it.
Formula Structure
All confidence intervals follow this structure:
$$ \text{Estimate} \pm (\text{Critical Value} \times \text{Standard Error}) $$The term \((\text{Critical Value} \times \text{Standard Error})\) is called the Margin of Error.
Did you know? The most common confidence level in scientific research is 95%. This corresponds to a significance level (\(\alpha\)) of 0.05.
3. Calculating Confidence Intervals for the Mean (\(\mu\))
The method used for calculating the CI for the mean depends entirely on two factors: sample size (\(n\)) and whether the population variance (\(\sigma^2\)) is known.
A. CI for \(\mu\) when Population Variance (\(\sigma^2\)) is KNOWN
If we know the population variance, we use the Normal distribution (Z-scores), regardless of the sample size \(n\), thanks to the Central Limit Theorem (CLT).
Step-by-Step Z-Interval
- Check Assumptions: Either the population is Normally distributed, OR \(n\) is large (\(n > 30\)) for CLT to apply.
- Identify Values: Find \(\bar{x}\), \(\sigma\), \(n\), and the confidence level.
- Find the Critical Z-Value (\(z\)): Use the Normal distribution tables (or calculator) for the required confidence level.
- Calculate the Interval: $$ \bar{x} \pm z \times \frac{\sigma}{\sqrt{n}} $$
Quick Review: Common Z-Values:
(These are crucial values you often use, corresponding to a two-tailed test):
- 90% Confidence: \(z = 1.6449\)
- 95% Confidence: \(z = 1.9600\)
- 99% Confidence: \(z = 2.5758\)
B. CI for \(\mu\) when Population Variance (\(\sigma^2\)) is UNKNOWN
When \(\sigma^2\) is unknown, we must estimate it using the sample variance, \(\hat{\sigma}^2\). When we do this, especially with small samples, the distribution of the test statistic is no longer Normal. Instead, we use the Student's t-distribution.
Important Point: The $t$-distribution is wider and flatter than the Normal distribution, reflecting the extra uncertainty that comes from estimating $\sigma^2$ from the sample data.
Step-by-Step T-Interval
- Check Assumptions: We must assume the original population is Normally distributed. (If \(n\) is very large, the $t$-distribution approximates the Normal distribution).
- Calculate \(\hat{\sigma}\): Use \(s/\sqrt{n-1}\) where \(s^2\) is the standard sample variance, or use the square root of the unbiased estimate: \(\hat{\sigma} = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}\).
- Determine Degrees of Freedom (\(v\)): \(v = n-1\).
- Find the Critical T-Value (\(t_v\)): Use the $t$-distribution tables with \(v = n-1\) and the required confidence level (two-tailed).
- Calculate the Interval: $$ \bar{x} \pm t_v \times \frac{\hat{\sigma}}{\sqrt{n}} $$
Don't worry if this seems tricky at first. The main difference between Z and T is simply which table you look up! Remember: Unknown variance? Use T!
4. Confidence Intervals for Proportion (\(p\))
If we are investigating a binary outcome (e.g., success/failure, yes/no), we are interested in the population proportion, \(p\). We estimate \(p\) using the sample proportion, \(\hat{p}\).
For the CI calculation to be valid, we usually require a large enough sample size such that \(n\hat{p} > 5\) and \(n(1-\hat{p}) > 5\).
Step-by-Step Proportion Interval
- Identify Values: The number of successes \(x\), sample size \(n\), and \(\hat{p} = x/n\).
- Standard Error: The standard error of the sample proportion is calculated using the estimate: $$ \text{SE} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} $$
- Find the Critical Z-Value (\(z\)): Use the Normal distribution table (just like the known variance case).
- Calculate the Interval: $$ \hat{p} \pm z \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} $$
Common Mistake to Avoid: When dealing with proportion, you always use the Z-distribution (the Normal approximation) because, for sufficiently large samples, the distribution of \(\hat{p}\) is approximately normal.
5. Confidence Interval for Population Variance (\(\sigma^2\))
Sometimes, the spread of the data is more important than the average (e.g., consistency in manufacturing). To calculate the CI for the population variance, \(\sigma^2\), we must assume the population is Normally distributed, and we use the Chi-Squared (\(\chi^2\)) distribution.
The statistic used is:
$$ \chi^2 = \frac{(n-1)\hat{\sigma}^2}{\sigma^2} $$The \(\chi^2\) distribution is:
- Defined only for positive values.
- Asymmetric (it is skewed to the right).
- Defined by the degrees of freedom, \(v = n-1\).
Step-by-Step Chi-Squared Interval
- Check Assumptions: The population must be Normally distributed.
- Identify Values: Calculate the unbiased variance estimate, \(\hat{\sigma}^2\), and the degrees of freedom, \(v = n-1\).
- Find Critical \(\chi^2\) Values: Since the distribution is asymmetric, you need two critical values from the \(\chi^2\) table (with \(v = n-1\)). For a 95% CI:
- \(\chi^2_L\): The value corresponding to \((1 - 0.025)\) or 0.975 (the left tail).
- \(\chi^2_R\): The value corresponding to \(0.025\) (the right tail).
- Calculate the Interval (for \(\sigma^2\)): $$ \left( \frac{(n-1)\hat{\sigma}^2}{\chi^2_{R}}, \frac{(n-1)\hat{\sigma}^2}{\chi^2_{L}} \right) $$
Note the flip! Notice how the smaller \(\chi^2_L\) value is used for the upper bound of the variance interval, and the larger \(\chi^2_R\) value is used for the lower bound. This is because \(\sigma^2\) is in the denominator of the test statistic.
Key Takeaway on Distributions for CI:
- Mean (\(\mu\)), $\sigma$ known: Z
- Mean (\(\mu\)), $\sigma$ unknown: T (d.o.f. \(n-1\))
- Proportion (\(p\)): Z
- Variance (\(\sigma^2\)): \(\chi^2\) (d.o.f. \(n-1\))
6. Hypothesis Tests for Single Samples
Hypothesis testing (covered in detail in earlier modules) determines if there is enough statistical evidence to reject a null hypothesis (\(H_0\)) in favour of an alternative hypothesis (\(H_1\)). In S3, we apply these standard procedures using the Z, T, and $\chi^2$ distributions.
A. Tests for the Population Mean (\(\mu\))
The procedure depends on the known/unknown variance, just like confidence intervals:
Case 1: \(\sigma^2\) is Known (Z-Test)
Test Statistic: $$ Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} $$
We compare the calculated \(Z\) value against the critical $Z$ value from the Normal distribution.
Case 2: \(\sigma^2\) is Unknown (T-Test)
We use the unbiased estimate \(\hat{\sigma}\) for the standard deviation.
Test Statistic: $$ T = \frac{\bar{X} - \mu_0}{\hat{\sigma} / \sqrt{n}} $$
We compare the calculated \(T\) value against the critical $T$ value from the $t$-distribution, using \(v = n-1\) degrees of freedom.
B. Tests for the Population Variance (\(\sigma^2\))
If we want to test whether the population variance is equal to a specified value, \(\sigma_0^2\), we use the Chi-Squared test.
Hypotheses Example:
\(H_0: \sigma^2 = \sigma_0^2\)
\(H_1: \sigma^2 \neq \sigma_0^2\) (Two-tailed) or \(H_1: \sigma^2 > \sigma_0^2\) (One-tailed)
The Test Statistic (\(\chi^2\))
Test Statistic: $$ \chi^2 = \frac{(n-1)\hat{\sigma}^2}{\sigma_0^2} $$
We compare the calculated \(\chi^2\) value against the critical \(\chi^2\) value(s) from the tables, using \(v = n-1\) degrees of freedom.
Step-by-Step Hypothesis Test Summary:
- State \(H_0\) and \(H_1\): Define the null and alternative hypotheses clearly.
- State Significance Level (\(\alpha\)): Usually 5% (0.05) or 1% (0.01).
- Calculate Test Statistic: Use the appropriate formula (Z, T, or \(\chi^2\)).
- Find Critical Value(s): Look up the table using the correct distribution and degrees of freedom (if necessary).
- Conclusion: Compare the test statistic to the critical value(s) or use the $p$-value. Reject \(H_0\) if the test statistic falls in the critical region.
- Contextual Answer: Write a concluding sentence related back to the original problem.
Engagement Tip: Connection Between CI and Testing
A 95% confidence interval can be used to perform a two-tailed hypothesis test at the 5% significance level. If the hypothesized value (\(\mu_0\) or \(\sigma_0^2\)) falls outside the confidence interval, you reject \(H_0\). If it falls inside, you do not reject \(H_0\).
Final Key Takeaway
The entire chapter hinges on recognizing which distribution (Z, T, or \(\chi^2\)) is required based on two factors: which parameter you are estimating (\(\mu\), \(p\), or \(\sigma^2\)), and whether the population variance is known or unknown.