Mystery of n-1 (Part 1)

Mystery of n-1 (Part 1: Introduction and Observations)

Download Spreadsheet:

Transcript

One of the mysteries of statistics is why this popular formula for estimating population variance has an (n-1) in the denominator.

We will address this mystery from 3 perspectives:

First we will show that using an n in the denominator gives a wrong answer.

Then we will explain why it gives a wrong answer.

And finally, we will figure out the correct formula.

(0:31/4:33)

For purposes of this demonstration, our population will be all of the Smarties in this box. We weighted each Smartie and then calculated the population mean and variance. The ideas we present here apply even though the histogram of the Smarties is neither Gaussian nor symmetric.

Suppose that we selected three Smarties at random and calculated the mean and variance of that sample.

(0:58/4:33)

That’s a lot of work, so let’s use our Sampling Spreadsheet to take samples electronically. On the left you see a list of the weights of each Smartie, and the software randomly selects a sample containing as many as we specify. For example, if I choose three as the sample size, and I click on this button it takes a sample of three Smarties from the list on the left. If we take several samples, we see that there is quite a bit of variability in them. Maybe we should take a whole bunch of 3-Smartie samples – say 100. Let’s also take samples of different sizes. It’s difficult to see the exact spread in the values associated with each sample size, but if we click this button the spreadsheet will show the mean values for each specific sample size with a red dot.

(1:53/4:33)

As you can see, the sample mean agrees fairly well with the population mean on average. The spread in the calculated means decreases as the sample size is made larger, a result consistent with the expected spread of the sample mean as given by the formula: sigma divided by root n.

In fact, if I click this button, the spreadsheet will plot that bound for me and it agrees quite nicely with our experiment. As you can see, the sample mean agrees well with the population mean shown by a red line, and we say that it provides an unbiased estimate of the population mean.

If we look at the variance graph we see that the spread in sample variances also decreases as the sample size is made larger. However, its mean also changes, and it gives values that are clearly too small when the sample size is small. If we plot standard deviations instead of variances, we get the same puzzling result.

(2:55/4:33)

So, from this simple experiment, we learn that the sample variance, given by this formula, is not a good estimate of the population variance, given by this formula.

As the red dots show, it gives values that are on average smaller than the population variance, especially if the small samples are taken. In the language of statistics, we say that the sample variance provides a “biased” estimate of the population variance. That’s not good.

(3:25/4:33)

If we push this button on the spread sheet it shows us what would happen if we used a denominator of n-1 instead of n, and we see that we immediately get a better answer.

In Part 3 we will show mathematically why this adjustment, known as Bessel’s correction is the right one to use. But first, let’s see if we can understand why the variance of a sample is usually smaller than the variance of the population from which it is drawn, and that is the subject of Part 2.

(3:55/4:33)

Now here is a fun fact for you to know. You can produce all the graphs you saw in this video without the aid of a computer! To do this, you could take a sample of 2 items from a population, say a bunch of apples or sticks or anything else you can find and then you could calculate the mean and variance of that sample and plot them. You could do this one hundred times, just like the spreadsheet software did for us, and you could use any sample size. Whether you try the experiment of not, it is important that you just take a moment to think through the steps and the formulas used to generate the plots that you saw in this video.

Thanks for watching EasyStats! Bye for now!