In the following discussion there will be several windows for you to paste or type code into. All of the windows do the same thing ... just use the one closest to wherever you are when you are asked to produce some computer results. There will also be example code for you to use. It will look like this:
#Here is an example of example code x <- rnorm(100, 5, 2) # 100 random normal numbers with # mean = 5 and std. dev. = 2 # The "#" sign indicates a comment that will explain what the # code is supposed to be doing
A lot of the code will be reused. Only the new part of the code in each code snippet will have comments explaining what it is doing. If you want, you can edit the code after you have pasted it into the windows.
The simple linear regression model says that the average response, for a given value of the predictor, is given by the equation
So the average response is a line. The actual data we observe have responses that are scattered around this line. The scatter is often called the error and it's represented by the greek letter epsilon. The epsilon in the following model is what is causing the scatter of the data around around the line
If there wasn't an error term all of the data would fall on the line.
The following code snippet will create a the line representing the average value of the response ... cut it out, paste it into the code window and see what the line looks like (use the back button to come back to this page).
x <- rnorm(100, 5, 2) # 100 random normal numbers with # mean = 5 and std. dev. = 2 y <- 2 + 3*x # this is for a line with intercept = 2 # and slope = 3 plot(x,y,type="n") # set up the plotting area abline(2, 3, col=2) # put in a line with intercept = 2 # and slope = 3
Not very interesting was it. Well, try to plot some data with the following code snippet. Erase everything in the above window (use the button) and then cut and paste the following snippet.
x <- rnorm(100, 5, 2) y <- 2 + 3*x plot(x,y,type="n") abline(2, 3, col=2) points(x,y) # add the data to the plotThat was a little better, but all of the data is on the line ... there is no error term. Let's add a little error and see what happens. In the following code snippet the line with the rnorm(100,0,1) adds an error term that has a normal distribution with mean 0 and standard deviation 1. That's what the 0 and 1 mean in rnorm(100,0,1) (the 100 says to generate 100 error terms, one for each of the 100 x values).
x <- rnorm(100, 5, 2) y <- 2 + 3*x + rnorm(100,0,1) # add error term with std. dev = 1 plot(x,y,type="n") abline(2, 3, col=2) points(x,y)
There, that's a little more like it. Looks like regression data, but those data points are pretty close to the line. Recall, the error term has a distribution (I know, you don't recall that particular fact but look in your notes or the book ... it's there). The distribution of the error (epsilon in our notation) is normal with mean 0 and standard deviation denoted by sigma.
In the code snippet you just ran the normal part comes from using the rnorm function, the mean of 0 comes from the 0 in the second place, and sigma is specified as 1, the last entry. Go back to the code window and change the value for sigma (try some numbers like 2 or 7 or .5) and see what happens to the plot. You can change the value for the mean as well ... see what happens.
Keep in mind that the line in the plot is the true regression line. Normally all we see is the data and then we guess at what the line looks like. Let's see what happens when we estimate the line from the data. The following code snippet uses least squares to fit a line to the data. The fitted line is then added to the graph as a dashed line with color 4. The colors you see depend on your browser. On the one I'm using the true regression line is red and the fitted line is blue ... they might be pretty close together so look closely.x <- rnorm(100, 5, 2) y <- 2 + 3*x + rnorm(100,0,1) plot(x, y, type="n") abline(2, 3, col=2) points(x, y) fit1 <- lm(y ~ x) # fitting a regression line abline(coef(fit1), lty=2, col=4) # adding the fitted line summary(fit1) # printing out the results
Go back to the last computer output page (use the forward button) and look at the residual standard error in the regression results printout. It should be close to 1 since that is the value that we put in for the standard deviation of the error term. Try changing the value for the standard deviation (just edit the code window) and see what happens to the standard error in the regression printout. Also, watch what happens to the Multiple R-squared value and the plot.
Another thing to try is to just keep resubmitting the same code, since the numbers come from a random number generator the results should represent what would happen if you took different samples from the same population. This might give you some idea about the sampling distribution of various regression estimators. Look at how the regression coefficient estimates, R-squared, and the standard error change as you resample from the same model.
One last thing to try for this lesson. The underlying model assumes that the true regression line is linear ... what if it isn't. The following code snippet uses a quadratic function (x squared) as the underlying model instead of a straight line. Try it and see how things work out. Change some of the parameters, such as the standard deviation of the error or the the power on x, and see what changes in the output.
x <- rnorm(100, 5, 2) x <- sort(x) # sort the x's (for plotting) meanOfy <- 2 + 3*x^2 # Mean of the y's is a # quadratic function of x # To change the power on x change # the 2 in x^2 y <- meanOfy + rnorm(100,0,5) # add the error term to get data # this data has a std dev of 5 plot(x, y, type="n") lines(x, meanOfy, col = 2) # add the true mean line points(x, y) fit1 <- lm(y ~ x) abline(coef(fit1), lty=2, col=4) summary(fit1)