# Why divide by n-1? Understanding the sample variance with pairwise differences

Most people find Statistics 101 to be pretty intuitive... until all of a sudden it isn't. One of the first challenges is why we divide the by $$n-1$$ instead of $$n$$ when computing the corrected sample variance.

There is an explanation that I personally like most because the proof can be broken up into a few intuitive pieces:

1. A variance can be described equivalently in terms of how much people tend to differ from the average, or in terms of how much people tend to differ from each other.
• This means that one way to estimate variance is to look at pairwise differences between people in our sample.
2. For each individual in our sample, there are only n-1 people we can compare them to.
3. Averaging over all possible comparisons gives us the corrected sample variance (multiplied by 2).

A quick note, the $$n-1$$ result is often taught in conjunction with the concept of degrees of freedom, which I won't go into here.

This is not just a heuristic. If we look at each point in more detail, we can prove that the corrected sample variance is unbiased...

## 1) Details: variance as an expected pairwise difference

First we'll quickly formalize the notion in Point 1. Suppose we want to know the variance of a random variable $$Y$$ with expectation denoted by $$\mu=\mathbb{E}(Y)$$. Let $$Y_1$$ and $$Y_2$$ denote two independent random variables with the same distribution as $$Y$$, e.g., corresponding to two people in the same population. Consider how different we expect them to be from each other, in terms of squared differences.

\begin{align} &\mathbb{E}\left[(Y_1 - Y_2)^2\right] \nonumber \\ &= \mathbb{E}\left[(Y_1 - \mu + \mu - Y_2)^2\right] \nonumber \\ &= \mathbb{E}\left[(Y_1 - \mu )^2\right] - 2\mathbb{E}\left[Y_1 - \mu\right]\mathbb{E}\left[Y_2 - \mu\right] + \mathbb{E}\left[(Y_2 - \mu)^2\right] \nonumber \\ &= 2\mathbb{E}\left[(Y - \mu )^2\right] - 0 \nonumber \\ &= 2\text{Var}(Y) \label{EY}\\ \end{align}

Above, the third equality comes from the fact that $$Y_1$$ and $$Y_2$$ have the same distribution as $$Y$$.

In words, we just showed that the expected squared difference between two observations is equal to twice the variance.

## 2) Details: for each individual in the sample, there are only n-1 comparisons available

Now we can estimate the variance of $$Y$$ by looking at how much pairs of observations in our sample differ. Specifically, given a sample $$y_1,\dots,y_n,$$ we can look at all pairs $$(y_i,y_j)$$ and their associated differences $$(y_i-y_j)^2$$. There are $$n(n-1)$$ such (ordered) pairs in total, and averaging over all of them gives

$$\frac{1}{n(n-1)} \sum_{i=1}^n\sum_{j\neq i}(y_i - y_j)^2.\label{goal}$$

This is the $$n-1$$ we were looking for! We already know that each term in this double summation is unbiased for $$2Var(Y)$$ (from our steps in Section 1), so we also know that the average (Eq $$\ref{goal}$$ itself) is unbiased for $$2Var(Y)$$.

## 3) Details: the average over all possible comparisons (Eq $$\ref{goal}$$) is equal to twice the corrected sample variance

Let $$\bar{y}=\sum_{i=1}^{n} y_i$$ denote the sample average. All that remains is to show that Eq $$\ref{goal}$$ is equal to twice the corrected sample variance: $$\frac{2}{n-1}\sum_{i=1}^{n}\left(y_i - \bar{y} \right)^2$$.

To simplify the double summation in Eq $$\ref{goal}$$, it will be helpful to note that $$(y_i-y_j)^2=0$$ when $$i=j$$. This means that

$$\sum_{i=1}^n\sum_{j\neq i}(y_i - y_j)^2 = \sum_{i=1}^n\sum_{j=1}^n(y_i - y_j)^2.\label{zero-ij}$$

Putting all of this together, we can apply similar steps as we did in Eq $$\ref{EY}$$ to reduce the average of all pairwise differences (Eq $$\ref{goal}$$). All we have to do is replace population expectations with in-sample averages throughout.

\begin{align*} &\frac{1}{n(n-1)}\sum_{i=1}^{n}\sum_{j=1}^{n}\left[(y_i - y_j)^2\right] \\ &= \frac{1}{n(n-1)}\sum_{i=1}^{n}\sum_{j=1}^{n}\left[\left(y_i - \bar{y} + \bar{y} - y_j\right)^2\right] \\ &= \frac{1}{n-1}\sum_{i=1}^{n}\left(y_i - \bar{y} \right)^2 - \frac{2}{n(n-1)}\sum_{i=1}^{n}\left(y_i - \bar{y}\right)\sum_{j=1}^{n}\left(y_j - \bar{y}\right) + \frac{1}{n-1}\sum_{j=1}^{n}\left(y_j - \bar{y} \right)^2\\ \\ &= \frac{2}{n-1}\sum_{i=1}^{n}\left(y_i - \bar{y} \right)^2 - 0 \\ &= 2\widehat{Var}(Y). \end{align*}

Where $$\widehat{Var}(Y)$$ is the corrected sample variance.

That's it! The corrected sample variance turns out to be equal to the average over all pairwise differences (Eq $$\ref{goal}$$)! We already said that Eq $$\ref{goal}$$ is unbiased for the 2 times population variance, which means $$\widehat{Var}(Y)$$ (with the n-1 denominator) is unbiased for $$Var(Y)$$.

[Note: this post was updated on 2022-08-01 to simplify equations]