Why divide by n-1? Understanding the sample variance with pairwise differences

Most people find Statistics 101 to be pretty intuitive... until all of a sudden it isn't. One of the first challenges is why we divide the by \(n-1\) instead of \(n\) when computing the corrected sample variance.

There is an explanation that I personally like most because the proof can be broken up into a few intuitive pieces:

  1. A variance can be described equivalently in terms of how much people tend to differ from the average, or in terms of how much people tend to differ from each other.
    • This means that one way to estimate variance is to look at pairwise differences between people in our sample.
  2. For each individual in our sample, there are only n-1 people we can compare them to.
  3. Averaging over all possible comparisons gives us the corrected sample variance (multiplied by 2).

A quick note, the \(n-1\) result is often taught in conjunction with the concept of degrees of freedom, which I won't go into here.

This is not just a heuristic. If we look at each point in more detail, we can prove that the corrected sample variance is unbiased...

1) Details: variance as an expected pairwise difference

First we'll quickly formalize the notion in Point 1. Suppose we want to know the variance of a random variable \(Y\) with expectation denoted by \(\mu=\mathbb{E}(Y)\). Let \(Y_1\) and \(Y_2\) denote two independent random variables with the same distribution as \(Y\), e.g., corresponding to two people in the same population. Consider how different we expect them to be from each other, in terms of squared differences.

\begin{align} &\mathbb{E}\left[(Y_1 - Y_2)^2\right] \nonumber \\ &= \mathbb{E}\left[(Y_1 - \mu + \mu - Y_2)^2\right] \nonumber \\ &= \mathbb{E}\left[(Y_1 - \mu )^2\right] - 2\mathbb{E}\left[Y_1 - \mu\right]\mathbb{E}\left[Y_2 - \mu\right] + \mathbb{E}\left[(Y_2 - \mu)^2\right] \nonumber \\ &= 2\mathbb{E}\left[(Y - \mu )^2\right] - 0 \nonumber \\ &= 2\text{Var}(Y) \label{EY}\\ \end{align}

Above, the third equality comes from the fact that \(Y_1\) and \(Y_2\) have the same distribution as \(Y\).

In words, we just showed that the expected squared difference between two observations is equal to twice the variance.

2) Details: for each individual in the sample, there are only n-1 comparisons available

Now we can estimate the variance of \(Y\) by looking at how much pairs of observations in our sample differ. Specifically, given a sample \(y_1,\dots,y_n,\) we can look at all pairs \((y_i,y_j)\) and their associated differences \((y_i-y_j)^2\). There are \(n(n-1)\) such (ordered) pairs in total, and averaging over all of them gives

\begin{equation} \frac{1}{n(n-1)} \sum_{i=1}^n\sum_{j\neq i}(y_i - y_j)^2.\label{goal} \end{equation}

This is the \(n-1\) we were looking for! We already know that each term in this double summation is unbiased for \(2Var(Y)\) (from our steps in Section 1), so we also know that the average (Eq \(\ref{goal}\) itself) is unbiased for \(2Var(Y)\).

3) Details: the average over all possible comparisons (Eq \(\ref{goal}\)) is equal to twice the corrected sample variance

Let \(\bar{y}=\sum_{i=1}^{n} y_i\) denote the sample average. All that remains is to show that Eq \(\ref{goal}\) is equal to twice the corrected sample variance: \(\frac{2}{n-1}\sum_{i=1}^{n}\left(y_i - \bar{y} \right)^2\).

To simplify the double summation in Eq \(\ref{goal}\), it will be helpful to note that \((y_i-y_j)^2=0\) when \(i=j\). This means that

\begin{equation} \sum_{i=1}^n\sum_{j\neq i}(y_i - y_j)^2 = \sum_{i=1}^n\sum_{j=1}^n(y_i - y_j)^2.\label{zero-ij} \end{equation}

Putting all of this together, we can apply similar steps as we did in Eq \(\ref{EY}\) to reduce the average of all pairwise differences (Eq \(\ref{goal}\)). All we have to do is replace population expectations with in-sample averages throughout.

\begin{align*} &\frac{1}{n(n-1)}\sum_{i=1}^{n}\sum_{j=1}^{n}\left[(y_i - y_j)^2\right] \\ &= \frac{1}{n(n-1)}\sum_{i=1}^{n}\sum_{j=1}^{n}\left[\left(y_i - \bar{y} + \bar{y} - y_j\right)^2\right] \\ &= \frac{1}{n-1}\sum_{i=1}^{n}\left(y_i - \bar{y} \right)^2 - \frac{2}{n(n-1)}\sum_{i=1}^{n}\left(y_i - \bar{y}\right)\sum_{j=1}^{n}\left(y_j - \bar{y}\right) + \frac{1}{n-1}\sum_{j=1}^{n}\left(y_j - \bar{y} \right)^2\\ \\ &= \frac{2}{n-1}\sum_{i=1}^{n}\left(y_i - \bar{y} \right)^2 - 0 \\ &= 2\widehat{Var}(Y). \end{align*}

Where \(\widehat{Var}(Y)\) is the corrected sample variance.

That's it! The corrected sample variance turns out to be equal to the average over all pairwise differences (Eq \(\ref{goal}\))! We already said that Eq \(\ref{goal}\) is unbiased for the 2 times population variance, which means \(\widehat{Var}(Y)\) (with the n-1 denominator) is unbiased for \(Var(Y)\).

[Note: this post was updated on 2022-08-01 to simplify equations]