Simple Linear Regression Computations
The following steps can be used in simple (univariate) linear regression model development and testing:
1. Plot the data y vs. x. Here y is the dependent variable and x is the independent variable. Study visually if the points formed by all y and x pairs (y,x) appear to fall in a linear pattern (straight line). If the relationship between the values of the dependent and independent variables does not appear to be linear, do not use simple linear regression. In this latter case investigate other regression model types (polynomial regression, nonlinear regression).
2. Estimate model parameters
0
and
1 using b0
and b1 of the estimated model
= b0+b1x using
3. Present data and the model in the same graph; all data points (y,x)
and
= b0+b1x.
Note: The model line should go through the center of the data point grouping, so that about half of the points are located above and half of the points are located below the line.
4. Calculate the sums of squares (SST,SSE,SSR) for model- and parameter
testing, and evaluate the test statistics
tcalc,
0,
tcalc,
1
and Fcalc.
Here
5. Test parameter significance using a t-test, (i.e. are the parameters significantly different from zero).
Note: For significant parameters (i.e. parameters significantly different from zero either + or -) you are looking for large test statistic values (absolute values), and want to reject the null hypotheses. The larger (in absolute terms) a test statistic value is the more significant the associated variable. A good rule-of-thumb is to have the t-test statistic values above +4 or below -4. A test statistic value inside this interval signifies that the associated variable is either not significant or borderline significant. To obtain more accurate critical values please refer to the t-table in Statistical Tables.
6. Test the significance of the overall regression using an F-test. This test will determine if a significant proportion of total variability in the data (measured by SST) can be attributed to the relationship between variables (measured by SSR), or if total variability is primarily due to randomness (measured by SSE).
The regression (model) is significant when a significant proportion of the total variability SST is caused by the relationship between the variables (SSR). On the other hand the regression (model) is not significant when a significant proportion of the total variability (SST) is caused by randomness (SSE).
Note: For significant regression you are looking for a large F-value. The larger the F-value the more significant the regression. Please analyze the above F-test statistic formula. In that formula the nominator has the regression sum of squares, SSR, and the denominator the error sum of squares, SSE. This ratio increases as the errors, and hence SSE, decrease. Errors decrease when data points move closer to the model line. Compare the earlier Cases 5A, 5B and 5C. Which one of those cases is likely to have the smallest SSE, the smallest SSR, the largest SSR.
7. Determine the coefficient of correlation r. Recall, that the coefficient of correlation r measures the amount of linear relationship between y and x.
8. Determine the coefficient of determination r2. The coefficient of determination r2 measures the amount of variation of the dependent variable explained by the model.
Note: Did you notice that the coefficient of determination is the correlation coefficient squared! While r varies between -1 and + 1, R2 varies between 0 and 1.
For example, in the above Case 5A we would say that about 2.33% (r2=0.0233) of the variability in the data is explained by the model (97.7% of variability is not captured). This is a very bad outcome. However, in Case 5B about 94.75% (r2=0.9475) of the variability in the data is explained by the model (only 5.25% of variability is not captured). This, on the other hand, is very good. Even, without any calculation, by only visually analyzing the graphs you should be able to conclude that Case 5B is a quite strong simple linear regression case, whereas Cases 5A and 5C are weak linear regression cases.
9. Carry out an analysis of the residuals to verify if the Ordinary Least Squares (OLS) assumptions hold. The OLS assumptions in regression state that the errors are independent, approximately normally distributed with mean zero and a constant variance, i.e.
Note: A simple way to do this is to plot the residuals
ei=yi-
against the estimated response
.
If the OLS assumptions hold, then this plot should display
points randomly in an approximately horizontal band of uniform width.
You can easily find the residuals by subtracting values found using
the model
from the original y
data values.