# Why divide by n-1? Understanding the sample variance with pairwise differences

Most people find Statistics 101 to be pretty intuitive... until all of a sudden it isn't. One of the first challenges is why we divide the by $$n-1$$ instead of $$n$$ when computing the corrected sample variance.

There is an explanation that I personally like most, because the proof can be broken up into a few intuitive pieces:

1. A variance can be described equivalently in terms of how much people tend to differ from the average, or in terms of how much people tend to differ from each other.
• This means that one way to estimate variance is to look at pairwise differences between people in our sample.
2. For each individual in our sample, there are only n-1 people we can compare them to.
3. Averaging over all possible comparisons gives us the corrected sample variance (multiplied by 2).

A quick note, the $$n-1$$ result is often taught in conjunction with the concept of degrees of freedom, which I won't go into here.

This is not just a heuristic. If we look at each point in more detail, we can prove that the corrected sample variance is unbiased...

## 1) Details: variance as an expected pairwise difference

First we'll quickly formalize the notion in Point 1. Suppose we want to know the variance of a random variable $$Y$$. Let $$Y_1$$ and $$Y_2$$ denote two independent random variables with the same distribution as $$Y$$, e.g., corresponding to two people in the same population. Consider how different we expect them to be from each other, in terms of squared differences.

\begin{align*} \mathbb{E}((Y_1 - Y_2)^2) &= \mathbb{E}(Y_1^2) - 2\mathbb{E}(Y_1)\mathbb{E}(Y_2) + \mathbb{E}(Y_2^2)\\ &= 2\mathbb{E}(Y^2) - 2(\mathbb{E}Y)^2\\ &= 2\text{Var}(Y)\\ \end{align*}

Above, the first line comes from linearity of expectations and the independence of $$Y_1$$ and $$Y_2$$. The second line comes from the fact that $$Y_1$$ and $$Y_2$$ have the same distribution as $$Y$$. The third line comes from a common expansion of the variance.

In words, we just showed that the expected squared difference between two observations is equal to twice the variance.

## 2) Details: for each individual in the sample, there are only n-1 comparisons available

Now we can estimate the variance of $$Y$$ by looking at how much pairs of observations in our sample differ. Specifically, given a sample $$y_1,\dots,y_n,$$ we can look at all pairs $$(y_i,y_j)$$ and their associated differences $$(y_i-y_j)^2$$. There are $$n(n-1)$$ such (ordered) pairs in total, and averaging over all of them gives

\begin{equation} \frac{1}{n(n-1)} \sum_{i=1}^n\sum_{j\neq i}(y_i - y_j)^2.\label{goal} \end{equation}

This is the $$n-1$$ we were looking for! We already know that each term in this double summation is unbiased for $$2Var(Y)$$ (from our steps in Section 1), so we also know that the average (Eq $$\ref{goal}$$ itself) is unbiased for $$2Var(Y)$$.

## 3) Details: the average over all possible comparisons (Eq $$\ref{goal}$$) is equal to twice the corrected sample variance

To simplify the double summation in Eq $$\ref{goal}$$, it will be helpful to note that $$(y_i-y_j)^2=0$$ when $$i=j$$. This means that

\begin{equation} \sum_{i=1}^n\sum_{j\neq i}(y_i - y_j)^2 = \sum_{i=1}^n\sum_{j=1}^n(y_i - y_j)^2.\label{zero-ij} \end{equation}

It will also be helpful to note that $$\sum_{i=1}^ny_i=n\bar{y}$$, where $$\bar{y}$$ is the sample average, and that

\begin{equation} \sum_{i=1}^ny_i\bar{y}=n\bar{y}^2 =\sum_{i=1}^n\bar{y}^2. \label{ybar2} \end{equation}

Putting all of this together, we can apply similar steps as we did in Section 1 to reduce the average of all pairwise differences (Eq $$\ref{goal}$$) into twice the corrected sample variance.

\begin{align*} \frac{1}{ n(n-1)}\sum_{i=1}^n\sum_{j\neq i}(y_i - y_j)^2 &=\frac{1}{ n(n-1)} \sum_{i=1}^n\sum_{j=1}^n(y_i - y_j)^2 & \text{from Eq }\ref{zero-ij}&\\ &= \frac{1}{ n(n-1)} \left\{\sum_{i=1}^n\sum_{j=1}^ny_i^2 -2 \sum_{i=1}^ny_i \sum_{j=1}^ny_j + \sum_{i=1}^n\sum_{j=1}^n y_j^2 \right\} & &\\ &= \frac{1}{ n(n-1)} \left\{n\sum_{i=1}^n y_i^2 -2 \sum_{i=1}^ny_i (n\bar{y}) + n\sum_{j=1}^n y_j^2 \right\} & &\\ &= \frac{2}{ (n-1)} \left\{\sum_{i=1}^n y_i^2 - \sum_{i=1}^ny_i\bar{y} \right\} & &\\ &= \frac{2}{ (n-1)} \left\{\sum_{i=1}^n y_i^2 - 2\sum_{i=1}^ny_i\bar{y} + \sum_{i=1}^ny_i\bar{y} \right\} & &\\ &= \frac{2}{ (n-1)} \left\{\sum_{i=1}^n y_i^2 - 2\sum_{i=1}^ny_i\bar{y} + \sum_{i=1}^n\bar{y}^2 \right\} & \text{from Eq }\ref{ybar2}&\\ &= \frac{2}{ (n-1)} \sum_{i=1}^n (y_i - \bar{y})^2 & &\\ &= 2\widehat{Var}(Y). & & \end{align*}

Where $$\widehat{Var}(Y)$$ is the corrected sample variance.

That's it! The corrected sample variance turns out to be equal to the average over all pairwise differences (Eq $$\ref{goal}$$)! We already said that Eq $$\ref{goal}$$ is unbiased for the 2 times population variance, which means $$\widehat{Var}(Y)$$ (with the n-1 denominator) is unbiased for $$Var(Y)$$.