Welcome to Goodness of Fit and Contingency Tables!
Hi there! Welcome to a crucial chapter in Statistics 3. This section is all about becoming a data detective, helping us determine if the data we observe in the real world fits a specific pattern or if two variables are truly connected.
Don't worry if this seems tricky at first; we will break down the powerful Chi-Squared ($\chi^2$) test step-by-step. By the end of this, you’ll be able to confidently test statistical models and relationships!
Why is this chapter important?
- It allows us to validate theoretical models (like Poisson or Binomial distributions) against real-world observations.
- It provides a formal method to test relationships between categories (e.g., "Is preference for football independent of age group?").
- It is a fundamental concept in advanced statistical analysis.
Section 1: The Chi-Squared ($\chi^2$) Test Statistic
The Chi-Squared test is the engine behind both "Goodness of Fit" and "Contingency Table" analysis. It measures how much our observed frequencies ($O_i$) deviate from the expected frequencies ($E_i$).
What does the $\chi^2$ statistic measure?
Imagine you are throwing darts at a target. You expect the darts to land mostly in the bullseye region (the expected pattern). The $\chi^2$ statistic tells you how far away, on average, your actual throws (the observed data) landed from where they should have been.
If the calculated $\chi^2$ value is small, the observed data fits the expectation well. If it's large, something is wrong with the expectation, and we reject the null hypothesis.
The Formula for the Test Statistic, $X^2$
The test statistic, $X^2$, is calculated as the sum of squared differences between observed and expected frequencies, weighted by the expected frequency:
$$X^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$- $O_i$: The Observed Frequency in category $i$. This is the raw data you collected.
- $E_i$: The Expected Frequency in category $i$. This is what theory or the null hypothesis suggests should happen.
- The sum ($\sum$) is taken over all categories or cells.
Quick Review: The $X^2$ value is always positive because the differences are squared. A large $X^2$ means a bad fit.
Section 2: Goodness of Fit Tests (GoF)
A Goodness of Fit test checks whether a sample of data comes from a population with a specified distribution (e.g., Uniform, Normal, Poisson, Binomial, or simply fixed probabilities).
Step-by-Step Guide to a Goodness of Fit Test
Step 1: State the Hypotheses
The hypotheses for a Goodness of Fit test are always structured this way:
$$H_0: \text{The data follows the specified distribution (e.g., Poisson, Uniform, or specific probabilities).}$$ $$H_1: \text{The data does not follow the specified distribution.}$$
Note: The $\chi^2$ test is always a one-tailed test, focusing only on the upper critical value, as we are only interested in whether $X^2$ is too large (i.e., the fit is too poor).
Step 2: Calculate Expected Frequencies ($E_i$)
This is where you use the theoretical distribution specified in $H_0$ and the total sample size ($N$).
If testing against fixed probabilities ($p_i$):
$$E_i = N \times p_i$$
If testing against a specific distribution (e.g., Poisson):
1. Find the necessary parameter(s) for the distribution (e.g., $\lambda$ for Poisson or $p$ for Binomial).
2. Use the distribution's formula to find the probability $P(X=x)$ for each category.
3. Calculate $E_i = N \times P(X=x)$.
Step 3: Check Conditions and Combine Cells (Pooling)
The $\chi^2$ test relies on an approximation. This approximation is only reliable if the expected frequencies are large enough.
Crucial Condition: Every expected frequency ($E_i$) must be greater than or equal to 5 ($E_i \ge 5$).
If any $E_i < 5$, you must pool (combine) that category with the adjacent category until the combined expected frequency is $\ge 5$. This applies to both the observed and expected values in those categories.
Step 4: Calculate the Test Statistic ($X^2$)
Apply the formula using the final (potentially pooled) $O_i$ and $E_i$ values:
$$X^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$Step 5: Determine Degrees of Freedom ($\nu$)
The Degrees of Freedom ($\nu$) is perhaps the most challenging part of the GoF test, so pay close attention!
The general formula for GoF is:
$$\nu = (\text{Number of categories, } k) - 1 - (\text{Number of parameters estimated, } m)$$- $k$: The number of categories after any necessary pooling/combining of cells.
- The $-1$: This is always subtracted because the total of the expected frequencies must equal the total of the observed frequencies (meaning the last category's frequency is fixed once the others are known).
- $m$: This is the number of parameters you had to estimate from the sample data in order to calculate $E_i$.
- If you are testing against fixed probabilities (e.g., $P(\text{Heads})=0.5$), then $m=0$.
- If you had to calculate $\lambda$ (for Poisson) or $p$ (for Binomial/Geometric) from the sample data, then $m=1$.
- If you had to calculate $\mu$ and $\sigma$ (for Normal distribution), then $m=2$.
Memory Aid: If you had to use the sample data to find a number that goes into the theoretical formula, you "lose" a degree of freedom for that estimated parameter.
Step 6: Comparison and Conclusion
1. Find the critical value for the $\chi^2$ distribution using your calculated $\nu$ and the significance level ($\alpha$) provided in the question.
2. Compare:
- If $X^2 \le \text{Critical Value}$: Accept } H_0$. There is sufficient evidence that the data fits the specified distribution.
- If $X^2 > \text{Critical Value}$: Reject } H_0$. There is sufficient evidence that the data does NOT fit the specified distribution.
Section 3: Contingency Tables (Test of Independence)
Contingency tables are used when we want to investigate the relationship between two categorical variables. This is called a Test of Independence.
Example: Is there an association between a person's favourite genre of music (Pop, Rock, Classical) and their primary mode of transport (Car, Bus, Bike)?
The Goal: Testing Independence
If two variables are independent, knowing the value of one variable tells you nothing about the value of the other. The test checks if the patterns observed in the data could have occurred just by chance if the variables were, in fact, independent.
Step 1: State the Hypotheses
$$H_0: \text{The two variables are independent (i.e., there is no association).}$$ $$H_1: \text{The two variables are not independent (i.e., there is an association).}$$
Step 2: Calculate Expected Frequencies ($E_{ij}$)
In a contingency table (with $r$ rows and $c$ columns), the expected frequency for any specific cell $(i, j)$ is calculated based on the assumption of independence ($H_0$):
$$E_{ij} = \frac{(\text{Row Total}) \times (\text{Column Total})}{\text{Grand Total}}$$Analogy: If 60% of people prefer Rock music, and 50% of people take the Bus, then, assuming independence, the proportion of people who prefer Rock AND take the Bus should be $0.60 \times 0.50 = 0.30$. We then multiply this probability by the Grand Total to find the expected count.
Step 3: Check Conditions
Just like the GoF test, all expected frequencies ($E_{ij}$) must be $\ge 5$. If any cell has $E_{ij} < 5$, you must combine rows or columns (pool) until this condition is met. Combining must be logical (e.g., combining two similar age groups).
Step 4: Calculate the Test Statistic ($X^2$)
The formula remains the same, but the sum is taken over all cells in the table:
$$X^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$Step 5: Determine Degrees of Freedom ($\nu$)
For a contingency table with $r$ rows and $c$ columns, the calculation for $\nu$ is much simpler than GoF:
$$\nu = (r-1)(c-1)$$Note: If you combined rows/columns due to small expected frequencies, use the number of rows/columns after combining.
Step 6: Comparison and Conclusion
Follow the same process as the GoF test: compare $X^2$ to the critical value from the $\chi^2$ table at the relevant significance level ($\alpha$) and degrees of freedom ($\nu$).
If $X^2$ is large (greater than the critical value), Reject $H_0$ and conclude there is evidence of an association between the variables.
- Goodness of Fit: Tests one sample against a single theoretical distribution (e.g., Is this data Poisson?). $\nu = k - 1 - m$.
- Contingency Table: Tests if two variables are associated (e.g., Is gender related to food preference?). $\nu = (r-1)(c-1)$.
Section 4: Common Mistakes and Summary
Common Mistakes to Avoid
- Using Observed Frequencies for the Condition Check: Students often check if $O_i \ge 5$. This is incorrect! You MUST check that $E_i \ge 5$.
- Degrees of Freedom Error (GoF): Forgetting to subtract $m$ when parameters (like $\lambda$ or $p$) were estimated from the sample data.
- Degrees of Freedom Error (Contingency): Using the number of cells instead of $(r-1)(c-1)$.
- Misstating Hypotheses: Confusing the hypotheses. $H_0$ always assumes the expected situation (the fit is good / the variables are independent).
- Forgetting to Pool: Failing to combine cells when $E_i < 5$, leading to an unreliable test result.
Final Check List
- Have you stated $H_0$ and $H_1$ clearly?
- Are all Expected Frequencies calculated correctly?
- Have you checked the $E_i \ge 5$ condition and pooled if necessary?
- Is the value for $\nu$ correct (especially considering parameter estimation $m$ in GoF)?
- Is the calculated $X^2$ statistic correct?
- Is the conclusion written in context, clearly stating whether $H_0$ is accepted or rejected?
You've mastered the fundamentals of the Chi-Squared test! This tool is incredibly versatile and powerful for making robust statistical inferences.