Why divide by n-1? Understanding the sample variance with pairwise differences

Most people find Statistics 101 to be pretty intuitive... until all of a sudden it isn't. One of the first challenges is why we divide the by \(n-1\) instead of \(n\) when computing the corrected sample variance.

There is an explanation that I personally like most, because the proof can be broken up into a few intuitive pieces:

  1. A variance can be described equivalently in terms of how much people tend to differ from the average, or in terms of how much people tend to differ from each other.
    • This means that one way to estimate variance is to look at pairwise differences between people in our sample.
  2. For each individual in our sample, there are only n-1 people we can compare them to.
  3. Averaging over all possible comparisons gives us the corrected sample variance (multiplied by 2).

A quick note, the \(n-1\) result is often taught in conjunction with the concept of degrees of freedom, which I won't go into here.

This is not just a heuristic. If we look at each point in more detail, we can prove that the corrected sample variance is unbiased...

1) Details: variance as an expected pairwise difference

First we'll quickly formalize the notion in Point 1. Suppose we want to know the variance of a random variable \(Y\). Let \(Y_1\) and \(Y_2\) denote two independent random variables with the same distribution as \(Y\), e.g., corresponding to two people in the same population. Consider how different we expect them to be from each other, in terms of squared differences.

\begin{align*} \mathbb{E}((Y_1 - Y_2)^2) &= \mathbb{E}(Y_1^2) - 2\mathbb{E}(Y_1)\mathbb{E}(Y_2) + \mathbb{E}(Y_2^2)\\ &= 2\mathbb{E}(Y^2) - 2(\mathbb{E}Y)^2\\ &= 2\text{Var}(Y)\\ \end{align*}

Above, the first line comes from linearity of expectations and the independence of \(Y_1\) and \(Y_2\). The second line comes from the fact that \(Y_1\) and \(Y_2\) have the same distribution as \(Y\). The third line comes from a common expansion of the variance.

In words, we just showed that the expected squared difference between two observations is equal to twice the variance.

2) Details: for each individual in the sample, there are only n-1 comparisons available

Now we can estimate the variance of \(Y\) by looking at how much pairs of observations in our sample differ. Specifically, given a sample \(y_1,\dots,y_n,\) we can look at all pairs \((y_i,y_j)\) and their associated differences \((y_i-y_j)^2\). There are \(n(n-1)\) such (ordered) pairs in total, and averaging over all of them gives

\begin{equation} \frac{1}{n(n-1)} \sum_{i=1}^n\sum_{j\neq i}(y_i - y_j)^2.\label{goal} \end{equation}

This is the \(n-1\) we were looking for! We already know that each term in this double summation is unbiased for \(2Var(Y)\) (from our steps in Section 1), so we also know that the average (Eq \(\ref{goal}\) itself) is unbiased for \(2Var(Y)\).

3) Details: the average over all possible comparisons (Eq \(\ref{goal}\)) is equal to twice the corrected sample variance

To simplify the double summation in Eq \(\ref{goal}\), it will be helpful to note that \((y_i-y_j)^2=0\) when \(i=j\). This means that

\begin{equation} \sum_{i=1}^n\sum_{j\neq i}(y_i - y_j)^2 = \sum_{i=1}^n\sum_{j=1}^n(y_i - y_j)^2.\label{zero-ij} \end{equation}

It will also be helpful to note that \(\sum_{i=1}^ny_i=n\bar{y}\), where \(\bar{y}\) is the sample average, and that

\begin{equation} \sum_{i=1}^ny_i\bar{y}=n\bar{y}^2 =\sum_{i=1}^n\bar{y}^2. \label{ybar2} \end{equation}

Putting all of this together, we can apply similar steps as we did in Section 1 to reduce the average of all pairwise differences (Eq \(\ref{goal}\)) into twice the corrected sample variance.

\begin{align*} \frac{1}{ n(n-1)}\sum_{i=1}^n\sum_{j\neq i}(y_i - y_j)^2 &=\frac{1}{ n(n-1)} \sum_{i=1}^n\sum_{j=1}^n(y_i - y_j)^2 & \text{from Eq }\ref{zero-ij}&\\ &= \frac{1}{ n(n-1)} \left\{\sum_{i=1}^n\sum_{j=1}^ny_i^2 -2 \sum_{i=1}^ny_i \sum_{j=1}^ny_j + \sum_{i=1}^n\sum_{j=1}^n y_j^2 \right\} & &\\ &= \frac{1}{ n(n-1)} \left\{n\sum_{i=1}^n y_i^2 -2 \sum_{i=1}^ny_i (n\bar{y}) + n\sum_{j=1}^n y_j^2 \right\} & &\\ &= \frac{2}{ (n-1)} \left\{\sum_{i=1}^n y_i^2 - \sum_{i=1}^ny_i\bar{y} \right\} & &\\ &= \frac{2}{ (n-1)} \left\{\sum_{i=1}^n y_i^2 - 2\sum_{i=1}^ny_i\bar{y} + \sum_{i=1}^ny_i\bar{y} \right\} & &\\ &= \frac{2}{ (n-1)} \left\{\sum_{i=1}^n y_i^2 - 2\sum_{i=1}^ny_i\bar{y} + \sum_{i=1}^n\bar{y}^2 \right\} & \text{from Eq }\ref{ybar2}&\\ &= \frac{2}{ (n-1)} \sum_{i=1}^n (y_i - \bar{y})^2 & &\\ &= 2\widehat{Var}(Y). & & \end{align*}

Where \(\widehat{Var}(Y)\) is the corrected sample variance.

That's it! The corrected sample variance turns out to be equal to the average over all pairwise differences (Eq \(\ref{goal}\))! We already said that Eq \(\ref{goal}\) is unbiased for the 2 times population variance, which means \(\widehat{Var}(Y)\) (with the n-1 denominator) is unbiased for \(Var(Y)\).