
A series of Java applets were written to help explain concepts in Probability and Statistics. One of these is shown here.
In setting up problems of this type, we assume that the data (xi, yi) arise from a physical process that has a linear relationship between two upper-case variables X and Y. The relationship between them is assumed to take the form
(1) Y = A + BX.
This is called the population regression equation because it describes the population of process from which the data points (xi, yi) are taken. In the Java applet shown below, these parameters are set in the Population Regression Equation section at the upper left of the applet. The default values have been set to A=0 and B=1.
The actual population or process may contain some variability. The standard error of estimate, Sy|x, is defined as the standard deviation of the spread of the points about the population regression equation. It may arise from variability in the original population or process, or from measurement errors introduced during data collection. Thus the actual process might be described by the equation
(2) Y = A + BX + E,
where E is a random variable with a standard deviation equal to Sy|x. Usually, E would be assumed to be normally distributed. Note also, that in this approach, the error is implicitly assumed to arise entirely in y, while x is assumed to be without error. Individual data points that arise from the above equation would be
(3) yi = A + Bxi + Ei.
In the Linear Regression Applet shown below, Sy|x can be set by the user.
The Generation of Data Points section of the applet allows the user to specify how many data points are to be generated using equation (3) and to specify the range of the data (X Mimimum and X Maximum). The user can also specify whether data points are to be distributed Uniformly or Randomly over the specified range.
The regression analysis is run by clicking on the Run Regression Simulation button. A different set of data is generated and a new regression analysis is carried out each time the button is clicked. Some intermediate and final calculations are given in the sidebar to the right of the output graph. Note that to facilitate programming, all confidence intervals are calculated using a normal distribution, even though a Student-t distribution would be more appropriate when the number of data points is less than about 30.
The user can determine whether or not to show the calculated regression line (y=a+bx), the 90% Confidence Interval for the Conditional Mean, and the 90% Confidence Interval for an Individual y Given x (i.e., for an additional data point).
Basic Operation:
With all boxes checked except for the 90% Confidence Interval for Individual Y Given X, run the regression analysis several times and determine the fraction of cases where Y=A+BX crosses the 90% Confidence Interval for the Conditional Mean. You would expect it to cross these curves approximately 10% of the time, since what is shown are the 90% confidence curves.
Use the simulation to investigate the effect of changes to such parameters as Sy|x and the Number of Data Points.
Note that the html source code of this page allows the user to adjust the colors and line widths in the output graphics. This feature allows the user to customize the graphics for classroom or desktop use. Colors and line widths in the original version (the version in this original web site) were optimized for use in a large classroom with a high-resolution video projector (at least 1024x768 pixels).
Comments on this applet are welcome at brodland@uwaterloo.ca