Download Spreadsheet:

Transcript

Have you ever wondered where the Student t distribution comes from or what it really means? The goal of this video is to provide simple answers to both of those questions.

Suppose you were given a sample that had a mean of x-bar = 2.4 and a variance of s^2 = 1.5. You do not know the actual properties of the mystery population from which that sample was taken. Its mean might have been 2.1 and its variance 3.0, as suggested by this orange box, or it might have been the values shown by this red box or this yellow box. In fact, there are an infinite number of possible populations from which our sample might have come. That is to say, our sample might correspond to this population, or this one, or any other specific population.

Let’s organize these populations based on their means and variances.

(0:56/7:20)

Our intuition tells us that it is more likely our sample came from a population with properties similar to our sample than from one that does not. Our sampling distributions video showed that the probability of a population with mean mu and variance sigma^2 giving a sample of size n with mean x-bar is ...

That video also showed that the probability of it giving a sample with variance s^2 is ...

We can calculate these probabilities using a spreadsheet. We can choose the sample mean and variance, and here we use the n-1 version for the latter. Based on these values, the spreadsheet generates a selection of populations from which that sample might have arisen. The values along the top show the range of means it considered and the values along the side show the range of variances.

Each box in the working area represents a particular population. If we click this button, the cells in the working area report their mean and are coloured to reflect that value. Clicking this button shows their variances.

Clicking this button causes the cells to report the probability of obtaining our sample mean from the population represented by that cell. Notice that when the variance is high, there is an increased probability of a population with a mean farther from the sample mean producing the observed sample. As we will see in a few moments, this phenomenon is the key to understanding the Student-t distribution.

Clicking this button shows the probability of obtaining the sample variance from each of the trial populations, and that probability is governed by this formula. Finally, clicking this button gives the product of these two probabilities. Since sample mean and variance are statistically independent, the probability of their joint occurrence is the product of their individual probabilities.

(3:08/7:20)
As in our Statistical Inference video we assume the occurrence of each population to be equally likely.  As a result, the numbers in the working area now show the probability of each represented population giving a sample with the mean and variance of our sample, namely x-bar = 2.4 and s^2 = 1.5. The working area has been made large enough that populations along its edge, the blue band, and outside of the area shown have very low probabilities of producing our sample, and so can be ignored. Thus, the trial populations shown represent all of the ones that could reasonably give rise to our sample.

Let’s do a couple of sample calculations. For example, what is the probability of a population with a mean of 2.7 and a variance of 3.0 producing a sample with a mean of 2.4? The answer is given by this expression, which ultimately gives 0.604. As in some of our previous videos, we have bent the rules for working with continuous distributions, and so this value should be considered only relative to parallel calculations for other potential populations. It should not be considered a properly scaled value.

The probability of a population with a mean of 2.7 and a variance of 3.0 producing a sample with a variance of 1.5 is found using this formula and it is 0.241. The probability of that population giving a sample with both the mean and variance of our sample is the product of these two probabilities, that is 0.145.

(4:44/7:20)

Now getting back to our spreadsheet, if we wanted to figure out the probability that our sample came from one of the populations that has a mean of say 2.7, we simply sum the probabilities associated with the populations that have that property, and they are the ones in this column. Then in accordance with Bayes’ Theorem, we divide this result by the sum of all of the column sums. The spreadsheet does these tallies and divisions for us, and it plots the resulting curve. That curve tells us the probability that a sample with a particular mean came from one of the many populations that has any certain value along the horizontal axis of the graph. The curve is one from the family of curves that William Gosset discovered in 1908 and that bears his pen-name, “Student”. The curve depends on nu, the number of degrees of freedom, and that is related to the sample size, n.  When nu is large, the Student–t distribution approaches a Normal shape.

If you plotted the values associated with any one row of the working area you would obtain a Normal distribution. The curves associated with rows near the top would be more stretched out while those from rows near the bottom would be more compressed. The Student-t t distribution can be considered a certain kind of sum across those Normal curves. Note that this summation is different from summing multiple random Normal variables together, as the latter would produce another Normal distribution.

(6:15/7:20)

The results shown are approximate since a rather coarse grid has been used, but they illustrate the underlying concepts.  As you can see, the Student-t distribution that the spreadsheet generates has wider tails than a Normal distribution, a result can be confirmed by plotting the theoretical equations for Normal and student-t distributions.

In case you are interested, row sums indicate the relative probability of obtaining a sample with the same variance as our sample from different potential populations, and those sums are related to the chi squared distribution. Note that it is the number in the denominator of the chi squared argument that changes as you move up the table. So, the curve you obtain is a reversed and rescaled curve compared to the standard chi squared curve. Any similarities you might observe between this curve and a standard chi squared curve are simply coincidental.

We hope you now understand the origin of the Student t distribution and have a better appreciation of its meaning. Thanks for watching.