Unit S2: Statistics 2 – Continuous Random Variables

Hello Statisticians! Getting Started with Continuous Distributions

Welcome to one of the most fundamental chapters in S2: Continuous Random Variables (CRVs). Don't worry if the formulas involving integrals look intimidating—this chapter is essentially about applying your Calculus skills (integration and differentiation) to solve probability problems! We are moving beyond simple counts (discrete variables) and into measuring things like time, height, and temperature.

Why is this important? Real-world phenomena often don't fall into neat, countable bins. If you measure how long a battery lasts, it could be 100.5 hours, 100.51 hours, or 100.5103 hours. CRVs help us model these situations accurately.


1. Understanding Continuous Random Variables (CRVs)

What makes a variable continuous?

A Random Variable, let's call it \(X\), is continuous if it can take any value within a specified range (an interval). Unlike discrete variables (where \(X\) can only be 0, 1, 2, 3...), a CRV can take infinitely many values between any two points.

Example: The time, \(T\), a customer waits in a queue. \(T\) could be 2 minutes, 2.3 minutes, 2.3001 minutes, etc.

Key Distinction: Probability at a Single Point

This is a crucial concept that often confuses students:

Because there are infinitely many possible values, the probability that a continuous variable takes on any exact specific value is always zero.

$$P(X = x) = 0$$

Analogy: Imagine trying to hit a target that is a single point on a long line. The chance of hitting that precise, infinitesimally small spot is zero.

What this means for calculations:

$$P(a < X < b) = P(a \leq X \leq b) = P(a < X \leq b)$$

The inclusion or exclusion of the endpoints does not matter because \(P(X=a)=0\) and \(P(X=b)=0\).

Key Takeaway: CRVs deal with intervals, and probability is found by measuring the area under a curve, not summing individual points.


2. The Probability Density Function (PDF), \(f(x)\)

Since we can't assign probability to single points, we use a function, \(f(x)\), to describe how the probability is distributed across the range of possible values. This is the Probability Density Function (PDF).

Properties of a Valid PDF

For any function \(f(x)\) to be a valid PDF for a random variable \(X\):

  1. Non-Negative: The function must never be negative for any value of \(x\) in its range. (You can't have negative probability!)
    $$f(x) \geq 0 \text{ for all } x$$
  2. Total Area is One: The total area under the entire curve must equal 1 (representing 100% of all possible outcomes).
    $$\int_{-\infty}^{\infty} f(x) dx = 1$$ Note: Since most PDFs are only defined over a specific interval, say \([a, b]\), this usually simplifies to: $$\int_{a}^{b} f(x) dx = 1$$

Calculating Probabilities using the PDF

The probability that \(X\) lies between two values, \(a\) and \(b\), is the area under the PDF curve between those limits.

$$P(a < X < b) = \int_{a}^{b} f(x) dx$$

Step-by-Step Example: Finding the Constant 'k'

Suppose the PDF is defined as \(f(x) = kx\) for \(0 \leq x \leq 2\), and 0 otherwise.

  1. Apply the Total Area Rule: The integral over the defined range must equal 1.
  2. $$\int_{0}^{2} kx \ dx = 1$$
  3. Integrate: $$ \left[ \frac{kx^2}{2} \right]_{0}^{2} = 1 $$
  4. Substitute Limits: $$ \left( \frac{k(2)^2}{2} \right) - \left( \frac{k(0)^2}{2} \right) = 1 $$ $$ 2k - 0 = 1 $$
  5. Solve for k: $$ k = \frac{1}{2} $$

Did you know? The concept of probability density is exactly why we use integration. Integration is the mathematical tool designed to measure the area beneath a curve!

Common Mistake: Forgetting to check the limits of integration. Always use the limits defined by the PDF for the specific calculation you are doing.

Key Takeaway: The PDF, \(f(x)\), tells us the shape of the distribution. Probabilities are areas, calculated by integration.


3. The Cumulative Distribution Function (CDF), \(F(x)\)

The Cumulative Distribution Function (CDF), \(F(x)\), gives the probability that the random variable \(X\) is less than or equal to a specific value \(x\).

$$F(x) = P(X \leq x)$$

Calculating the CDF from the PDF

To find \(F(x)\), you integrate the PDF, \(f(t)\), from the lowest possible value up to the point \(x\). We use \(t\) as the integration variable to avoid confusion with the limit \(x\).

$$F(x) = \int_{\text{Lowest Limit}}^{x} f(t) dt$$

Important Requirement: Defining \(F(x)\) Piecewise

A CDF must be defined for all real numbers, so it usually needs three parts:

  1. $$F(x) = 0 \text{ for } x < \text{Lower Boundary}$$
  2. $$F(x) = \int f(t) dt \text{ for } \text{Lower Boundary} \leq x \leq \text{Upper Boundary}$$
  3. $$F(x) = 1 \text{ for } x > \text{Upper Boundary}$$

Using the CDF to Find Probabilities

If you have the CDF, calculating probabilities over an interval is much faster, often avoiding further integration:

$$P(a < X < b) = F(b) - F(a)$$

Reversing the Process: PDF from CDF

Since the CDF is the integral of the PDF, the PDF must be the derivative of the CDF!

$$f(x) = \frac{d}{dx} F(x) = F'(x)$$

Trick: Remember that integration (finding CDF) and differentiation (finding PDF) are inverse operations, just like they are in pure maths.

Key Takeaway: The CDF is the running total of probability. It always starts at 0 and ends at 1.


4. Measures of Location (Mode, Median, Mean)

These measures tell us where the distribution is centred or peaked.

4.1 The Mode

The Mode is the value of \(x\) where the probability density function \(f(x)\) is at its maximum (the peak of the curve).

  • If \(f(x)\) is a simple function (like a quadratic or cubic), you find the mode by setting the first derivative to zero: \(f'(x) = 0\), and confirming it's a maximum within the range.
  • If \(f(x)\) is piecewise (defined by different functions in different ranges), you must check the maximum value of \(f(x)\) at the boundaries and within the function's range.

4.2 The Median (\(m\))

The Median is the value \(m\) that splits the distribution exactly in half. 50% of the probability lies below \(m\), and 50% lies above \(m\).

We find the median \(m\) by solving one of the following equations:

  1. Using the CDF: $$F(m) = 0.5$$
  2. Using the PDF: $$\int_{\text{Lower Limit}}^{m} f(x) dx = 0.5$$

Don't worry! Using the CDF is usually faster if you have already calculated it.

4.3 The Mean (Expected Value, \(E[X]\))

The Mean, or Expected Value (\(E[X]\) or \(\mu\)), is the "centre of mass" of the distribution. It is the weighted average of all possible values, where the weights are the densities \(f(x)\).

The formula for the mean is:

$$E[X] = \mu = \int_{-\infty}^{\infty} x f(x) dx$$

Expected Value of a Function

If you need to find the expected value of some function of \(X\), say \(g(X)\) (like \(X^2\) or \(3X+5\)), the general formula is:

$$E[g(X)] = \int_{-\infty}^{\infty} g(x) f(x) dx$$

Memory Aid (for E[X]): Remember, for discrete variables, \(E[X] = \sum x P(X=x)\). For continuous variables, the summation sign (\(\sum\)) becomes an integral sign (\(\int\)), and \(P(X=x)\) becomes \(f(x) dx\). You just insert the \(x\) inside the integral alongside \(f(x)\).

Key Takeaway: Location measures are found using derivatives (Mode), equating the CDF to 0.5 (Median), or integrating \(x f(x)\) (Mean).


5. Measures of Spread (Variance and Standard Deviation)

These measures tell us how spread out the distribution is around the mean.

5.1 The Variance (\(\text{Var}[X]\))

The Variance is the average squared distance of the observations from the mean. Calculating variance usually involves two steps:

  1. Find \(E[X]\) (the mean, \(\mu\)).
  2. Find \(E[X^2]\).
Step 1: Calculating \(E[X^2]\)

Using the general expected value formula with \(g(x) = x^2\):

$$E[X^2] = \int_{-\infty}^{\infty} x^2 f(x) dx$$

Step 2: Applying the Formula

We use the computational formula for variance (which is much easier than integrating \((x-\mu)^2 f(x)\)):

$$\text{Var}[X] = E[X^2] - (E[X])^2$$

or

$$\text{Var}[X] = \left( \int x^2 f(x) dx \right) - \mu^2$$

5.2 The Standard Deviation (\(\sigma\))

The Standard Deviation is simply the square root of the variance. It is preferred because it is measured in the same units as \(X\) and the mean, \(\mu\).

$$\sigma = \sqrt{\text{Var}[X]}$$

Quick Review: Steps to Find Variance

  1. Calculate \(\mu = E[X]\) by integrating \(x f(x)\).
  2. Calculate \(E[X^2]\) by integrating \(x^2 f(x)\).
  3. Calculate \(\text{Var}[X] = E[X^2] - (E[X])^2\).

Important Note: Do not round intermediate values! Keep your fractions or exact decimal values for \(E[X]\) and \(E[X^2]\) until the very last step to ensure accuracy in your final variance answer.

Key Takeaway: Spread is measured by variance, which requires finding \(E[X]\) and \(E[X^2]\) using the density function.


6. Summary of Key Skills (The Calculus Toolkit)

Continuous random variables rely entirely on applying calculus skills within a statistical framework. Make sure you are comfortable with these operations:

Goal Mathematical Operation Calculus Link
Finding Probability \(P(a < X < b)\) Area under PDF Integration \(\int_{a}^{b} f(x) dx\)
Finding CDF \(F(x)\) Cumulative Area Integration \(\int_{\text{Lower}}^{x} f(t) dt\)
Finding PDF \(f(x)\) Rate of change of CDF Differentiation \(F'(x)\)
Finding Mean \(E[X]\) Weighted Integration Integration \(\int x f(x) dx\)
Finding Mode Peak of density Differentiation \(f'(x) = 0\)

Keep practising your integration techniques, especially involving polynomials (which are very common for PDFs in this unit!). You’ve got this!