Santiago Barreda
Assistant Professor, Department of Linguistics, UC Davis

Main Research Vignettes CV phonTools Statistics Links

 

Why is the sample variance calculated over n-1 instead of n?

The variance of a variable is the expected squared deviation of the variable about its mean value. When this parameter is estimated using a sample mean rather than the population mean, the variance tends to be underestimated. This is because the sample mean is almost always different from the population mean, and this variation is not being accounted for in variance estimates based on the sample mean.

Consider equation (1) where the left side of the equation is the sum of squares about the population mean for a variable x with n observations. This sum of squares can be decomposed into the two terms on the right side of the equation. The middle terms is the squared difference between each observed value and the sample mean. This value can be directly measured given a set of values. The rightmost term is n times the squared difference between the sample mean and the population mean. This value cannot be directly measured, since usually the population mean is unknown (which is why the sample mean is being used). 

(1)

In cases where the sample mean differs from the population mean (most cases), the rightmost term in equation will be a non-zero positive value. This means that the middle term in (1) (sample mean sum of squares) must be less than or equal to the leftmost term in (1) (population mean sum of squares) to satisfy the equation.

The terms in equation (1) can be rearranged in order to isolate the sum of squares about the sample mean on the left hand side. This is presented in equation (2):

(2)

This new arrangement makes it clear that the sum of squares about the sample mean is smaller than the sum of squares about the population mean. Furthermore, this difference is directly related to the difference between the sample mean and the population mean.  

The middle term in (2) only needs to be divided by n in order to be a population variance. Another way to look at this is that it is equal to n times the population variance. This also makes intuitive sense since the variance is the expected mean square, the expected value of the sum of the middle term in (2) would equal n times the variance. The middle term in (3) has been modified to reflect this.

(3)

The rightmost term on the right side of the equation is more easily handled if the bracketed element is considered independently. The expected value of a squared deviation of the sample mean about the population mean is equal to the variance over n (this is related to the standard error of the sample mean).  Equation (4) shows this change.

(4)

The rightmost term can be simplified so that the n multiple and divisors cancel each other out. At this point it is evident that the sum of squares about the sample mean involves the loss of a single degree of freedom so that the right side of (4) equals n-1 times the variance rather than n times the variance. Both sides of the equation can be divided by n-1:

(5)

The left hand side of the equation is now equal to the sample variance. This is the sum of squares about the sample mean, divided by n-1 degrees of freedom. The right hand side can be simplified to show that it is equal to the population variance by eliminating the n-1 term from the numerator and the denominator.

 (6)

This shows that the sample variance, when calculated over n-1 degrees of freedom, has the same expected value as the population variance. Calculating the sample variance over n rather than n-1 results in an underestimation of the population variance by a factor of (n - 1) / n.