Data Analysis and Modeling Project - Household Survey
This project is based on the data and the case presented in Modules 3 and 4. By now you should already be quite familiar with the data and the survey. The goal of this project is to give you an opportunity to practice on most aspects of regression modeling covered thus far in this course.
Note: On purpose many tasks are repetitive. However, with the help of software these tasks can be quite easily completed. There will be a lot of model output. It is your responsibility to be highly selective in summarizing, analyzing and presenting your results. You may choose any statistical software package, which is capable of statistical analysis. Recommended software includes Microsoft Excel and Statistica by StatSoft. If you use Microsoft Excel please make sure that statistical function add-inn's and Analysis Tools with regression are loaded.
You may recall from the discussion in Modules 3 and 4 that the survey was conducted to determine if the income of a head of household can be predicted from data consisting of information relating to households and heads of households. We already covered a few different scenarios, and generated a few models. In this project you are asked to go a little further.
One hundred complete surveys were received. It was believed that the sample represented well the demographic characteristics of the population of the target area.
Data on the following variables were gathered:
Here income shall be treated as the dependent variable, and all other variables are considered to be independent variables.
Using the software package of your choice, carry out the following tasks and answer the following questions. Summarize your answers, analysis and selected output to a written report not exceeding 10 pages. Forward your written report to me as a file attachment in Microsoft Word format by the due date (see the calendar of events).
Tasks and questions
1. Create scatter plots of all variable pairs. Visually analyze the plots with respect to relationships between variables. Briefly summarize each relationship, and show scatter plots of selected representative relationships.
2. Determine correlation coefficients for all variable pairs. Present correlation coefficients in a tabular form. Briefly analyze the correlations.
3. In Modules 3 and 4 we concluded that there appears to be a strong relationship between the income of the head of household (y) and the number of years education (x1) and/or size of residence (x2).
For this project we assume, that this relationship can be modeled using regression. We also assume that the 'best' regression model is any one of the possible simple-, multiple linear or polynomial models, considering the following complete second degree polynomial model as the largest model:
a. How many models and sub-models of the complete second order polynomial model are possible. Please list all possible models.
b. For each model of part a) carry out the following tasks
The data are available in a file upon request from the author in Microsoft Excel format.