Sampling Distributions

Download Spreadsheet:

Transcript

Suppose you took a series of samples from a population. You would not expect the means of those samples to be identical to each other, nor would you expect their variances to match perfectly. In this video we examine how the means and variances of a set of samples from the same population are distributed. How do you think the means will be distributed? Maybe like this? And how do you think the standard deviations will be distributed? Maybe like this? I think it will look like this.

(0:40/5:53)

We will need lots of samples in order to construct these distributions, and we will obtain them using a spreadsheet. Let’s assume that the population has a Normal distribution with known mean and standard deviation. Based on the population mean and standard deviation that we specify here, the spreadsheet generates n elements for each sample and calculates the sample mean and variance. Analyses that use a computer to generate random data and process it, like we are doing here, are called Monte Carlo simulations.

(1:12/5:53)

To start, let’s use a sample size of three and generate 500 separate samples. Sometimes the spreadsheet takes a while to run, and this window lets you monitor progress of the simulation. The spreadsheet generates histograms for the means and variances that it calculates. These histograms and any equations that we might find to describe them are called sampling distributions. The means looks like it is normally distributed, but the variance is visibly skewed.

It is not difficult to mathematically figure out the distribution of sample means. Recall that the mean of a sample of size n is a random variable produced by adding the random variables that describe each element in the sample and dividing by n. If the population is Normally distributed, each of these xi values will be too, and their means and standard deviations will be the same as those of the population from which they are taken. This new random variable x-bar will have a normal distribution and its mean will be 1/n times the sum of the means of the sample elements, which simplifies to mu. The variance of x-bar will be equal to 1/n2 times the sum of the variances of the sample means, which simplifies to sigma2/n. Thus, the sample means will be distributed according to a Normal distribution with a mean of mu and a standard deviation of sigma over root n. That is to say that the probability density function for x-bar will be a normal distribution with mean mu and standard deviation sigma over root n, where the function n is defined as shown.

(3:00/5:53)

Activating the “superimpose normal” box overlays this Normal curve on the histogram. If you generate enough samples of size n, the histogram agrees well with the theoretical answer.

Understanding how the variance values from a series of samples are distributed takes a little more work. Let’s define our sample variance as, s-squared is the sum of xi minus x-bar squared divided by n-1. As this definition gives an unbiased estimate of the population variance. The variance is the sum of a bunch of normal distributions squared, and sums of that kind are described by a family of so-called chi squared distributions– the subject of one of our other videos. It turns out that the chi squared distribution we need is the one with k=n-1 degrees of freedom. As we show in that video.

The variance sampling distribution turns out to be equal to the probability of s-squared is equal to n-1 divided by sigma squared times the chi squared distribution of type n-1, whose argument is s-squared times n-1 divided by sigma squared. Where the chi-squared distribution is defined by this function.

(4:20/5:53)

Let’s get back to the spreadsheet and use it to explore what happens when we change the sample size. If we set the scaling to “fixed”, then we will see how the width of the mean and variance change with sample size. If we set n=1, we get these distributions. We are in essence sampling single elements from the population, and their distribution is of course the same as that of the population, and the variance associated with a single point is zero. Choosing a sample size of 3 gives these curves, n=5 gives these and n=30, these.

In the real world, we usually take just one sample, and we want that sample to give us the best possible estimates of the population mean and variance. As our Monte Carlo simulations show, increasing the sample size causes the spread in the sample mean and variance distributions to become narrower, and that is why, in practise, we try to take as large a sample as practical.

(5:21/5:53)

Here is your homework: Change the population mean from 2 to 5 and see how the mean and variance distributions change. You might want to compare them with the histograms and curves in this video or with a second instance of the spreadsheet on your computer.

Now change the mean back to 2 and change the population standard deviation from 3 to 6. Now how do the distributions compare to the reference shapes in this video? Pay attention to the relative scales of the graph axes.

Thanks for watching EasyStats. Bye for now.