Simple Linear Regression - CASE
Measurement Accuracy of a Scale
This example looks at an actual sample data set of
90 observations. The data consist of weight measurements
(in milligrams, mg) of chemical mixtures. Two types
of scales are used: a mechanical field scale, and an
electronic scale in a controlled laboratory environment.
The electronic scale has been tested and found to be
accurate to within
0.1
milligrams 99.9% of the time. The laboratory conducts periodic
tests of its 800 field scales to determine, that the scales
continue to perform with acceptable accuracy. A mechanical
scale must measure the weight of a sample 95% of the time to
within
10 milligrams over
a weight range from 0 to 400 milligrams to be considered
acceptable.
It was, among other, decided to study if a simple linear regression model can sufficiently describe the relationship between the field scale measurements and the laboratory measurements.
For the purpose of model development 90 samples were taken. First, the weight of each sample was measured using a mechanical field scale, and then the same sample was weighted using the electronic laboratory scale. It is obvious that in an ideal situation both devices should obtain the same weight measurement reading for the same sample. However, because of differences in the devices, and how they measure the weight, both contain a measurement error. The mechanical devices have been found to be more inaccurate compared to the electronic scales, needing frequent calibration, but otherwise they are very robust and inexpensive.
The modeling and analysis consists of three parts
Note: Please keep in mind that all statements made here with respect to the simple linear regression are also valid in multivariate- and non-linear regression cases later.
Data Plot and Regression Line Fitting
You can see from the plot and animation that the pattern of data points appears to be increasing in a straight line fashion. There is some randomness, but overall the band of data points appears to be quite narrow with only a few outliers. Based on this visual analysis we expect that a simple linear regression model will describe the relationship between the field weight- and laboratory weight measurements quite well.
In the above animation we used the MS Excel Trendline function to
fit a linear regression line into the data set. The model is obtained to
= 4.789 + 1.006x with an
r2 = 0.972. The model explains 97.2% of
the variability in the data. However, from this information we cannot
conclude anything about model- or model parameter significance.
Regression Model Development, Parameter- and Model Testing
The below Summary Output is obtained using MS Excel Analysis Tools. Please note, that most statistical software generates a similar convenient table.
The first part of the table gives the correlation coefficient (r), coefficient of determination (r2), standard error, and the number of observations (n).
The second part of the table, titled ANOVA for Analysis of Variance contains the information for the overall significance testing of the model. This test is an F-test. The rows of the table partition the variability in the data into two groups: variability due to Regression, and variability due to error or Residual. The columns of the table give the degrees of freedom, (df), the Sums of Squares, (SS), the Mean Square, (MS) (which are the independent error variance estimates), the F-test statistic, and the Significance of F.
The Fcalc=3005.3 > fcritical indicates
that the regression model is very significant, i.e a significant proportion
of variability is due to the relationship between the variables. The
fcritical for df=(1,88) is obtained from
a statistical table (F-table). The same conclusion can be reached
using the Significance F-column value. This value can be compared to
a chosen level of significance (
).
The regression is significant if the Significance F-column value
is less than the chosen
.
Note: When you are comparing several models, the larger the F-value the more significant the model.
The lower part of the table summarizes the parameter- and parameter test information. The parameter tests are t-tests. The estimated parameter values for b0, b1 are given in the column titled Coeff, followed by their standard errors, t-statistics, P-values, and 95% confidence intervals. You can use either the t-test statistics or the P-values to conduct the tests.
Note: If the t-test statistic falls into the critical region, and the P-value is smaller than the chosen level of significance, and the confidence interval does not include zero, then the parameter is significant (i.e. significantly different from zero).
From the results one can see that the intercept, b0 is not significant, whereas the slope parameter b1 is significant. This suggests that the model should be rerun without b0.
Finally, we can use the residual plot to analyze visually whether or not the Ordinary Least Squares, OLS assumptions are satisfied. Please note that the residual plot shows the estimated errors (deviations between data points and the model line) against estimated model values. Here we are looking for a random pattern of points (independency or errors) forming a horizontal band (normality of errors and constancy of error variance).
From the residual plot you can see that the OLS assumptions are quite well supported.