Polynomial Regression - CASE

Household Survey

This case is a continuation of the Module 3, Multiple Linear Regression, household survey modeling case. Please recall, that the purpose of the survey was to determine if household income could be predicted with reasonable accuracy from data consisting of information relating to households and heads of households.

Here is brief review of the situation. Data on the following variables were gathered:

One hundred complete surveys were received. It was believed that the sample represented the demographic characteristics of the population of the target area quite well.

It was expected that there would be a significant relationship between the annual income and one or more of the other variables. There was no clear expectation about the relationship between income and family size. However and in particular, it was believed that, on an average, there is a positive relationship between income and

The analysis showed that there was a strong relationship between the annual income and educational level, as well as the annual income and the size of the residence. Please recall that income was treated as the dependent variable, and all other variables were considered to be independent variables.

In this section we will continue where we left off. We will try to improve the models based on an analysis of the residual plots. The model improvement will consist of the following

Note: Please keep in mind that all statements made here with respect to polynomial regression are also valid in other regression modeling.

Analysis of Residual Plots

The above animation shows the two residual plots for the simple linear regression models from the previous module. Both residual plots years of school completed and size of residence show a pattern, which first decreases and then increases. These curvilinear patterns suggest that adding polynomial terms to the simple linear regression models may improve the models.

This is exactly, what we will do. We will first graphically fit a second degree polynomial for each case using Microsoft Excel Trendline function, and then use the Regression Analysis Tool to develop the polynomial models and conduct the tests.

Improving the Model - Income vs. Education

In the above animation we used again the MS Excel Trendline function to fit first a simple linear regression line into the data set income vs. education. As we saw earlier the simple linear regression model becomes = 11772x - 100944 with an R2 = 0.6385. This means that the model explains about 63.85% of the variability in the data. Secondly, the animation shows a second degree polynomial model fitted to the data. The model becomes to = 1588.6x2- 32591x + 192105 with an R2 = 0.7797. There appears to be a remarkable improvement. We, however, don't know if the parameters are significant. We will develop this polynomial model using the Regression Analysis Tool and conduct the tests next.

The polynomial model - income vs. education - The table below gives the Microsoft Excel Regression output for two models: the simple linear regression model and the complete polynomial model. Please study this output very carefully. The animated tables change slowly to give you a chance to study the numbers. Please WAIT, SIT BACK and LOOK!!!

The simple linear regression model has an Fcalc=173.1 > fcritical. This value indicates that the regression model is very significant, i.e. a significant proportion of variability is due to the relationship between the variables. The R2=0.639 suggests that about 63.9% of the variability in the data is explained by the model.

The complete second order polynomial model has an Fcalc=171.6 > fcritical. This value indicates that the polynomial regression model is very significant, i.e. a significant proportion of variability is due to the relationship between the variables. Please notice that this latter F-value is slightly smaller than the one from the simple linear regression model. I know, ... we said earlier that a bigger F-value is better. We also said, that we will use common sense. Here the F-values are very close to each other, and at the same time, there is a notable improvement in the R2-value, from 0.639 with the simple linear regression model to 0.780 with the polynomial model. This cannot be ignored. The polynomial model explains about 78.0% of the variability in the data compared to 63.9% with the simple linear regression model.

The model parameter t-test results and confidence intervals show that all parameters in both models are significant.

We use the residual plots to analyze visually whether or not the Ordinary Least Squares, OLS assumptions are supported. Here we want to see in particular, if the polynomial model resulted into any improvement in the residual plot. You can see that there is some improvement, but still the residual plot is not 'perfect'. This means that the OLS assumptions continue still to be violated. A reason for this may be e.g. the outliers. You might want to check if removal of significant outliers improves the model and the residual plot. You might also investigate introduction of higher order polynomial terms, or interaction terms, into the model.

Please recall that a residual plot sometimes shows the estimated errors (deviations between data points and the model line) against estimated model values, and sometimes errors are plotted against significant independent variables (here 'years of school completed'). We are looking for a random pattern of points (independency or errors) forming a horizontal band of equal width (normality of errors and constancy of error variance).

We will leave further development of this model, Income vs. Education, and move on to look at the model Income vs. Size of Residence.

Improving the Model - Income vs. Size of Residence

In the above animation we used again the MS Excel Trendline function to fit first a simple linear regression line into the data income vs. size of residence. As we saw earlier the simple linear regression model becomes = 54.367x - 45345 with an R2 = 0.8233. This means that the model explains about 82.33% of the variability in the data. Secondly, the animation shows the polynomial second degree model fitted to the data. The model becomes to = 0.0149x2- 11.244x + 16813 with an R2 = 0.9058. Again, there appears to be a notable improvement. We, however, don't know if the parameters and the overall regression are significant. We will develop this polynomial model using the Regression Analysis Tool and conduct the tests next.

The polynomial model - income vs. size of residence - The table below was generated using Microsoft Excel Regression Analysis Tool. The tables show the software output for three models, the simple linear regression-, complete second degree polynomial- and improved polynomial models. Please study the tables very carefully. The animated tables change slowly to allow you to study the numbers. Please WAIT, SIT BACK and LOOK!!!

The simple linear regression model has an Fcalc=456.5 > fcritical. This value indicates that the regression model is very significant, i.e. a significant proportion of variability is due to the relationship between the variables. The R2=0.823 suggests that about 82.3% of the variability in the data is explained by the model.

The complete second order polynomial model has an Fcalc=466.4 > fcritical. This value indicates that the polynomial regression model is very significant, i.e. a significant proportion of variability is due to the relationship between the variables. The R2=0.906 suggests that about 90.6% of the variability in the data is explained by the model.

The improved second order polynomial model has an Fcalc=917.9 > fcritical. This value indicates that the polynomial regression model is very significant, i.e. a significant proportion of variability is due to the relationship between the variables. The R2=0.904 suggests that about 90.4% of the variability in the data is explained by the model. Please notice the remarkable increase in the F-value from model to model.

The simple linear regression model parameter t-test results and confidence intervals show that both parameters are significant.

The complete second order polynomial model parameter t-test results and confidence intervals show that parameters b0 and b2 are significant, and b1 is not significant. Based on this we will run the model without b1 and see if there is any improvement.

The improved second order polynomial model parameter t-test results and confidence intervals show that both remaining parameters are significant.

We use the residual plots to analyze visually whether or not the Ordinary Least Squares, OLS assumptions are supported. Here in particular, if the polynomial model resulted into any improvement in the residual plot. You can see also in this second case that there is some improvement, but still the residual plot is not 'perfect'. This means that the OLS assumptions continue still to be violated. A reason for this may be e.g. the outliers. You might want to check if removal of significant outliers improves the model and the residual plot. You might also investigate introduction of higher order polynomial terms, or interaction terms, into the model.

Please recall that a residual plot sometimes shows the estimated errors (deviations between data points and the model line) against estimated model values, and sometimes errors are plotted against significant independent variables (here 'size of residence'). We are looking for a random pattern of points (independency or errors) forming a horizontal band of equal width (normality of errors and constancy of error variance).