Simple Linear Regression Analysis: Difference between revisions
Line 169: | Line 169: | ||
where is the least square estimate of , and is its standard error which is calculated using: | where is the least square estimate of , and is its standard error which is calculated using: | ||
<math>se(\widehat{\beta}_0) </math> = <math>\sqrt{\frac{\displaystyle\sum_{i=1}^n e_i^2}{n-2} [ \frac{ | <math>se(\widehat{\beta}_0) </math> = <math>\sqrt{\frac{\displaystyle\sum_{i=1}^n e_i^2}{n-2} [ \frac{1}{n}+\frac{\bar{x}^2}{\sum_{i=1} (x_i-\bar{x})^2} ]} </math> | ||
(9) | (9) | ||
Revision as of 17:22, 30 June 2011
Simple Linear Regression Analysis
Introduction
Regression analysis is a statistical technique that attempts to explore and model the relationship between two or more variables. For example, an analyst may want to know if there is a relationship between road accidents and the age of the driver. Regression analysis forms an important part of the statistical analysis of the data obtained from designed experiments and is discussed briefly in this chapter. Every experiment analyzed in DOE++ includes regression results for each of the responses. These results, along with the results from the analysis of variance (explained in our "Analysis of Experiments" discussion), provide information that is useful to identify significant factors in an experiment and explore the nature of the relationship between these factors and the response. Regression analysis forms the basis for all DOE++ calculations related to the sum of squares used in the analysis of variance. The reason for this is explained in the last section of Chapter 6, Use of Regression to Calculate Sum of Squares. Additionally, DOE++ also includes a regression tool to see if two or more variables are related, and to explore the nature of the relationship between them. This chapter discusses simple linear regression analysis while Chapter 5 focuses on multiple linear regression analysis.
Simple Linear Regression Analysis
A linear regression model attempts to explain the relationship between two or more variables using a straight line. Consider the data obtained from a chemical process where the yield of the process is thought to be related to the reaction temperature (see Table 4.1). This data can be entered in DOE++ as shown in Figure 4.1 and a scatter plot can be obtained as shown in Figure 4.2. [Note] In the scatter plot yield, [math]\displaystyle{ y_i }[/math] is plotted for different temperature values, [math]\displaystyle{ x_i }[/math] . It is clear that no line can be found to pass through all points of the plot. Thus no functional relation exists between the two variables [math]\displaystyle{ x }[/math] and [math]\displaystyle{ Y }[/math]. [Note] However, the scatter plot does give an indication that a straight line may exist such that all the points on the plot are scattered randomly around this line. A statistical relation is said to exist in this case. The statistical relation between [math]\displaystyle{ x }[/math] and [math]\displaystyle{ Y }[/math] may be expressed as follows: (1)
- [math]\displaystyle{ Y=\beta_0+\beta_1{x}+\epsilon }[/math]
Table 4.1: Yield data observations of a chemical process at different values of reaction temperature.
Figure 4.1: Data entry in DOE++ for the observations in Table 4.1.
Figure 4.2: Scatter plot for the data in Table 4.1.
Eqn. (1) is the linear regression model that can be used to explain the relation between [math]\displaystyle{ x }[/math] and [math]\displaystyle{ Y }[/math] that is seen on the scatter plot above. In this model, the mean value of [math]\displaystyle{ Y }[/math] (abbreviated as [math]\displaystyle{ E(Y) }[/math]) is assumed to follow the linear relation [math]\displaystyle{ \beta_0=\beta_1{x} }[/math]:
- [math]\displaystyle{ E(Y)=\beta_0+\beta_1{x} }[/math]
The actual values of [math]\displaystyle{ Y }[/math], (which are observed as yield from the chemical process from time to time and are random in nature), are assumed to be the sum of the mean value, [math]\displaystyle{ E(Y) }[/math] , and a random error term, [math]\displaystyle{ \epsilon }[/math] :
- [math]\displaystyle{ Y=E(Y)+\epsilon }[/math]
- [math]\displaystyle{ =\beta_0=\beta_1{x}+\epsilon }[/math]
The regression model here is called a simple linear regression model because there is just one independent variable, [math]\displaystyle{ x }[/math] , in the model. In regression models, the independent variables are also referred to as regressors or predictor variables. The dependent variable, [math]\displaystyle{ Y }[/math] , is also referred to as the response. The slope, [math]\displaystyle{ \beta_1 }[/math] , and the intercept, [math]\displaystyle{ \beta_0 }[/math] , of the line [math]\displaystyle{ E(Y)=\beta_0=\beta_1{x} }[/math] are called regression coefficients. The slope, [math]\displaystyle{ \beta_1 }[/math] , can be interpreted as the change in the mean value of [math]\displaystyle{ Y }[/math] for a unit change in [math]\displaystyle{ x }[/math].
The random error term, [math]\displaystyle{ \epsilon }[/math] , is assumed to follow the normal distribution with a mean of 0 and variance of [math]\displaystyle{ \sigma^2 }[/math]. Since [math]\displaystyle{ Y }[/math] is the sum of this random term and the mean value, [math]\displaystyle{ E(Y) }[/math] , (which is a constant), the variance of [math]\displaystyle{ Y }[/math]at any given value of [math]\displaystyle{ x }[/math] is also [math]\displaystyle{ \sigma^2 }[/math]. Therefore, at any given value of [math]\displaystyle{ x }[/math], say [math]\displaystyle{ x_i }[/math], the dependent variable [math]\displaystyle{ Y }[/math] follows a normal distribution with a mean of [math]\displaystyle{ \beta_0+\beta_1{x_i} }[/math] and a standard deviation of [math]\displaystyle{ \sigma^2 }[/math]. This is illustrated in the following figure.
Figure 4.3: The normal distribution of for two values of . Also shown is the true regression line and the values of the random error term, , corresponding to the two values. The true regression line and are usually not known.
Fitted Regression Line
The true regression line corresponding to Eqn. (1) is usually never known. However, the regression line can be estimated by estimating the coefficients [math]\displaystyle{ \beta_1 }[/math] and [math]\displaystyle{ \beta_0 }[/math] for an observed data set. The estimates, [math]\displaystyle{ \widehat{\beta}_1 }[/math] and [math]\displaystyle{ \widehat{\beta}_0 }[/math], are calculated using least squares. (For details on least square estimates refer to [19]). The estimated regression line, obtained using the values of [math]\displaystyle{ \widehat{\beta}_1 }[/math] and [math]\displaystyle{ \widehat{\beta}_0 }[/math], is called the fitted line. The least square estimates, [math]\displaystyle{ \widehat{\beta}_1 }[/math] and [math]\displaystyle{ \widehat{\beta}_0 }[/math], are obtained using the following equations:(2)
- [math]\displaystyle{ \widehat{\beta}_1 }[/math]=[math]\displaystyle{ \frac{\sum_{i=1}^n y_i x_i- \frac{(\sum_{i=1}^n y_i) (\sum_{i=1}^n x_i)}{n}}{\sum_{i=1}^n (x_i-\bar{x})^2} }[/math]
- [math]\displaystyle{ \widehat{\beta}_0=\bar{y}-\widehat{\beta}_1 \bar{x} }[/math]
(3)
where [math]\displaystyle{ \bar{y} }[/math] is the mean of all the observed values and [math]\displaystyle{ \bar{x} }[/math] is the mean of all values of the predictor variable at which the observations were taken. [math]\displaystyle{ \bar{y} }[/math] is calculated using [math]\displaystyle{ \bar{y}=(1/n)\sum)_{i=1}^n y_i }[/math] and [math]\displaystyle{ \bar{x}=(1/n)\sum)_{i=1}^n x_i }[/math] is calculated using .
Once [math]\displaystyle{ \widehat{\beta}_1 }[/math] and [math]\displaystyle{ \widehat{\beta}_0 }[/math] are known, the fitted regression line can be written as: (4)
- [math]\displaystyle{ \widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1 x }[/math]
where [math]\displaystyle{ \widehat{y} }[/math] is the fitted or estimated value based on the fitted regression model. It is an estimate of the mean value, [math]\displaystyle{ E(Y) }[/math]. The fitted value,[math]\displaystyle{ \widehat{y}_i }[/math] , for a given value of the predictor variable, [math]\displaystyle{ x_i }[/math] , may be different from the corresponding observed value, [math]\displaystyle{ y_i }[/math]. The difference between the two values is called the residual, [math]\displaystyle{ e_i }[/math]: (5)
- [math]\displaystyle{ e_i=y_i-\widehat{y}_i }[/math]
Calculation of the Fitted Line Using Least Square Estimates
The least square estimates of the regression coefficients can be obtained for the data in Table 4.1 using the Eqns. (2) and (3) as follows:
[math]\displaystyle{ \widehat{\beta}_1 }[/math] = [math]\displaystyle{ \frac{\sum_{i=1}^n y_i x_i- \frac{(\sum_{i=1}^n y_i) (\sum_{i=1}^n x_i)}{n}}{\sum_{i=1}^n (x_i-\bar{x})^2} }[/math]
- =[math]\displaystyle{ \frac{322516-\frac{4158 x 1871}{25}}{5697.36} }[/math]
- =[math]\displaystyle{ 1.9952 }[/math]
- [math]\displaystyle{ \approx 2.00 }[/math]
[math]\displaystyle{ \widehat{\beta}_0 = \bar{y}-\widehat{\beta}_1 \bar{x} }[/math]
- = [math]\displaystyle{ 166.32 - 2 }[/math] x [math]\displaystyle{ 74.84 }[/math]
- = [math]\displaystyle{ 17.0016 }[/math]
- [math]\displaystyle{ \approx 17.00 }[/math]
Knowing and the fitted regression line is:
This line is shown in Figure 4.4.
Figure 4.4: Fitted regression line for the data in Table 4.1. Also shown is the residual for the 21st observation.
Once the fitted regression line is known, the fitted value of [math]\displaystyle{ Y }[/math] corresponding to any observed data point can be calculated. For example, the fitted value corresponding to the 21st observation in Table 4.1 is:
[math]\displaystyle{ \widehat{y}_{21} = \widehat{\beta}_0 = \widehat{\beta}_1 x_{21} }[/math]
- =[math]\displaystyle{ (17.0016) + (1.9952) }[/math] x [math]\displaystyle{ 93 }[/math]
- =[math]\displaystyle{ 202.6 }[/math]
The observed response at this point is [math]\displaystyle{ y_{21}=194 }[/math]. Therefore, the residual at this point is:
[math]\displaystyle{ e_{21} \lt \math\gt = \lt math\gt y_{21}-\widehat{y}_{21} ::= \lt math\gt 194-202.6 }[/math]
- = [math]\displaystyle{ -8.6 }[/math]
In DOE++, fitted values and residuals are available using the Diagnostic icon in the Control Panel. The values are shown in Figure 4.5.
Figure 4.5: Fitted values and residuals for the data in Table 4.1.
Hypothesis Tests in Simple Linear Regression
The following sections discuss hypothesis tests on the regression coefficients in simple linear regression. These tests can be carried out if it can be assumed that the random error term, [math]\displaystyle{ \epsilon }[/math] , is normally and independently distributed with a mean of zero and variance of [math]\displaystyle{ \sigma^2 }[/math].
[math]\displaystyle{ t }[/math] Tests
The tests are used to conduct hypothesis tests on the regression coefficients obtained in simple linear regression. A statistic based on the [math]\displaystyle{ t }[/math] distribution is used to test the two-sided hypothesis that the true slope, [math]\displaystyle{ \beta_1 }[/math] , equals some constant value, [math]\displaystyle{ \beta_{1,0} }[/math]. [Note] The statements for the hypothesis test are expressed as:
- [math]\displaystyle{ H_0 }[/math] : [math]\displaystyle{ \beta_1 = \beta_{1,0} }[/math]
- [math]\displaystyle{ H_1 }[/math] : [math]\displaystyle{ \beta_{1}\ne\beta_{1,0} }[/math]
The test statistic used for this test is:
- [math]\displaystyle{ T_0=\frac{\widehat{\beta}_1-\beta_{1,0}}{se(\widehat{\beta}_1)} }[/math](6)
where [math]\displaystyle{ \widehat{\beta}_1 }[/math] is the least square estimate of [math]\displaystyle{ \beta_1 }[/math], and [math]\displaystyle{ se(\widehat{\beta}_1) }[/math] is its standard error. The value of [math]\displaystyle{ se(\widehat{\beta}_1) }[/math] can be calculated as follows:
- [math]\displaystyle{ se(\widehat{\beta}_1) }[/math] = [math]\displaystyle{ \sqrt{\frac{\frac{\displaystyle \sum_{i=1}^n e_i^2}{n-2}}{\displaystyle \sum_{i=1}^n (x_i-\bar{x})^2}} }[/math]
(7)
The test statistic, [math]\displaystyle{ T_0 }[/math] , follows a [math]\displaystyle{ t }[/math] distribution with [math]\displaystyle{ (n-2) }[/math] degrees of freedom, where [math]\displaystyle{ n }[/math] is the total number of observations. The null hypothesis, [math]\displaystyle{ H_0 }[/math], is rejected if the calculated value of the test statistic is such that:
- [math]\displaystyle{ -t_{\alpha/2,n-2}\lt T_0\lt t_{\alpha/2,n-2} }[/math]
where [math]\displaystyle{ t_{\alpha/2,n-2} }[/math] and [math]\displaystyle{ -t_{\alpha/2,n-2} }[/math] are the critical values for the two-sided hypothesis. [math]\displaystyle{ t_{\alpha/2,n-2} }[/math] is the percentile of the [math]\displaystyle{ t }[/math] distribution corresponding to a cumulative probability of ([math]\displaystyle{ (1-\alpha/2) }[/math]) and [math]\displaystyle{ \alpha }[/math] is the significance level.
If the value of [math]\displaystyle{ \beta_{1,0} }[/math] used in Eqn. (6) is zero, then the hypothesis tests for the significance of regression. In other words, the test indicates if the fitted regression model is of value in explaining variations in the observations or if you are trying to impose a regression model when no true relationship exists between [math]\displaystyle{ x }[/math] and [math]\displaystyle{ Y }[/math]. Failure to reject [math]\displaystyle{ H_0:\beta_1=0 }[/math] implies that no linear relationship exists between [math]\displaystyle{ x }[/math] and [math]\displaystyle{ Y }[/math]. This result may be obtained when the scatter plots of against are as shown in 4.6 (a) and (b) of the following figure. Figure 4.6 (a) represents the case where no model exits for the observed data. In this case you would be trying to fit a regression model to noise or random variation. Figure 4.6 (b) represents the case where the true relationship between [math]\displaystyle{ x }[/math] and [math]\displaystyle{ Y }[/math] is not linear. Figure 4.6 (c) and (d) represent the case when [math]\displaystyle{ H_0:\beta_1=0 }[/math] is rejected, implying that a model does exist between [math]\displaystyle{ x }[/math] and [math]\displaystyle{ Y }[/math]. Figure 4.6 (c) represents the case where the linear model is sufficient. Figure 4.6, (d) represents the case where a higher order model may be needed.
Figure 4.6: Possible scatter plots of against . Plots (a) and (b) represent cases when is not rejected. Plots (c) and (d) represent cases when is rejected.
A similar procedure can be used to test the hypothesis on the intercept . The test statistic used in this case is:
- [math]\displaystyle{ T_0=\frac{\widehat{\beta}_00-\beta_{0,0}}{se(\widehat{\beta}_0)} }[/math](8)
where is the least square estimate of , and is its standard error which is calculated using:
[math]\displaystyle{ se(\widehat{\beta}_0) }[/math] = [math]\displaystyle{ \sqrt{\frac{\displaystyle\sum_{i=1}^n e_i^2}{n-2} [ \frac{1}{n}+\frac{\bar{x}^2}{\sum_{i=1} (x_i-\bar{x})^2} ]} }[/math] (9)
Example 4.1
The test for the significance of regression for the data in Table 4.1 is illustrated in this example. The test is carried out using the test on the coefficient . The hypothesis to be tested is . To calculate the statistic to test , the estimate, , and the standard error, , are needed. The value of was obtained in Chapter 4, Fitted Regression Line. The standard error can be calculated using Eqn. (7) as follows:
Then, the test statistic can be calculated using the following equation:
The value corresponding to this statistic based on the distribution with () degrees of freedom can be obtained as follows:
Assuming that the desired significance level is 0.1, since value < 0.1, is rejected indicating that a relation exists between temperature and yield for the data in Table 4.1. Using this result along with the scatter plot of Figure 4.2, it can be concluded that the relationship between temperature and yield is linear.
In DOE++, information related to the test is displayed in the Regression Information table as shown in Figure 4.7. In this table the test for is displayed in the row for the term Temperature because is the coefficient that represents the variable temperature in the regression model. The columns labeled Standard Error, T Value and P Value represent the standard error, the test statistic for the test and the value for the test, respectively. These values have been calculated for in this example. The Coefficient column represents the estimate of regression coefficients. For , this value was calculated using Eqn. (2). The Effect column represents values obtained by multiplying the coefficients by a factor of 2. This value is useful in the case of two factor experiments and is explained in Chapter 7, Two Level Factorial Experiments. Columns Low CI and High CI represent the limits of the confidence intervals for the regression coefficients and are explained in Chapter 4, Confidence Interval on Regression Coefficients. The Variance Inflation Factor column displays values that give a measure of multicollinearity. The concept of multicollinearity is only applicable to multiple linear regression models and is explained in Chapter 5, Multiple Linear Regression Analysis.
Figure 4.7: Regression results for the data in Table 4.1.
Analysis of Variance Approach to Test the Significance of Regression
The analysis of variance (ANOVA) is another method to test for the significance of regression. As the name implies, this approach uses the variance of the observed data to determine if a regression model can be applied to the observed data. The observed variance is partitioned into components that are then used in the test for significance of regression.
Sum of Squares
The total variance (i.e. the variance of all of the observed data) is estimated using the observed data. As mentioned in Chapter 3, Statistical Background, the variance of a population can be estimated using the sample variance, which is calculated using the following relationship:
The quantity in the numerator of the previous equation is called the sum of squares. It is the sum of the square of deviations of all the observations, , from their mean, . In the context of ANOVA this quantity is called the total sum of squares (abbreviated ) because it relates to the total variance of the observations. Thus:
(10)
The denominator in the relationship of the sample variance is the number of degrees of freedom associated with the sample variance. Therefore, the number of degrees of freedom associated with , , is . [Note] The sample variance is also referred to as a mean square because it is obtained by dividing the sum of squares by the respective degrees of freedom. Therefore, the total mean square (abbreviated ) is: (11)
When you attempt to fit a regression model to the observations, you are trying to explain some of the variation of the observations using this model. If the regression model is such that the resulting fitted regression line passes through all of the observations, then you would have a "perfect" model (see Figure 4.8 (a)). In this case the model would explain all of the variability of the observations. Therefore, the model sum of squares (also referred to as the regression sum of squares and abbreviated ) equals the total sum of squares; i.e. the model explains all of the observed variance:
Figure 4.8: A perfect regression model will pass through all observed data points as shown in (a). Most models are imperfect and do not fit perfectly to all data points as shown in (b).
For the perfect model, the regression sum of squares, , equals the total sum of squares, , because all estimated values, , will equal the corresponding observations, . can be calculated using a relationship similar to the one for obtaining by replacing by in the relationship of . Therefore: (12)
The number of degrees of freedom associated with , , is one. [Note ]
Based on the preceding discussion of ANOVA, a perfect regression model exists when the fitted regression line passes through all observed points. However, this is not usually the case, as seen in Figure 4.8 (b) or Figure 4.4. In both of these plots, a number of points do not follow the fitted regression line. This indicates that a part of the total variability of the observed data still remains unexplained. This portion of the total variability or the total sum of squares, that is not explained by the model, is called the residual sum of squares or the error sum of squares (abbreviated ). The deviation for this sum of squares is obtained at each observation in the form of the residuals, . The error sum of squares can be obtained as the sum of squares of these deviations: (13)
The number of degrees of freedom associated with , , is . [Note]
The total variability of the observed data (i.e. total sum of squares, ) can be written using the portion of the variability explained by the model, , and the portion unexplained by the model, , as: (14)
The above equation is also referred to as the analysis of variance identity and can be expanded as follows: (15)
The deviations for the three sum of squares are shown in Figure 4.9.
Figure 4.9: Scatter plots showing the deviations for the sum of squares used in ANOVA. (a) shows deviations for , (b) shows deviations for , and (c) shows deviations for .
Mean Squares
As mentioned previously, mean squares are obtained by dividing the sum of squares by the respective degrees of freedom. For example, the error mean square, , can be obtained as: (16)
The error mean square is an estimate of the variance, , of the random error term, , and can be written as:
Similarly, the regression mean square, , can be obtained by dividing the regression sum of squares by the respective degrees of freedom as follows:
F Test
To test the hypothesis , the statistic used is based on the distribution. It can be shown that if the null hypothesis is true, then the statistic: (17)
follows the distribution with degree of freedom in the numerator and degrees of freedom in the denominator. is rejected if the calculated statistic, , is such that:
where is the percentile of the distribution corresponding to a cumulative probability of () and is the significance level.
Example 4.2
The analysis of variance approach to test the significance of regression can be applied to the yield data in Table 4.1. To calculate the statistic, , for the test, the sum of squares have to be obtained. The sum of squares can be calculated as shown next.
The total sum of squares can be calculated as:
The regression sum of squares can be calculated as:
The error sum of squares can be calculated as:
Knowing the sum of squares, the statistic to test can be calculated as follows:
The critical value at a significance level of 0.1 is . Since is rejected and it is concluded that is not zero. Alternatively, the value can also be used. The value corresponding to the test statistic, , based on the distribution with one degree of freedom in the numerator and 23 degrees of freedom in the denominator is:
Assuming that the desired significance is 0.1, since the value < 0.1, then is rejected, implying that a relation does exist between temperature and yield for the data in Table 4.1. Using this result along with the scatter plot of Figure 4.2, it can be concluded that the relationship that exists between temperature and yield is linear. This result is displayed in the ANOVA table as shown in Figure 4.10. Note that this is the same result that was obtained from the test in Chapter 4, Confidence Interval on Fitted Values. The ANOVA and Regression Information tables in DOE++ represent two different ways to test for the significance of the regression model. In the case of multiple linear regression models these tables are expanded to allow tests on individual variables used in the model. This is done using extra sum of squares. Multiple linear regression models and the application of extra sum of squares in the analysis of these models are discussed in Chapter 5, Multiple Linear Regression Analysis. The term Partial appearing in Figure 4.10 relates to the extra sum of squares and is also explained in Chapter 5.
Figure 4.10: ANOVA table for the data in Table 4.1.
Confidence Intervals in Simple Linear Regression
A confidence interval represents a closed interval where a certain percentage of the population is likely to lie. For example, a 90% confidence interval with a lower limit of and an upper limit of implies that 90% of the population lies between the values of and . Out of the remaining 10% of the population, 5% is less than and 5% is greater than . (For details refer to [19].) This section discusses confidence intervals used in simple linear regression analysis.
Confidence Interval on Regression Coefficients
A 100() percent confidence interval on is obtained as follows: (18)
Similarly, a 100() percent confidence interval on is obtained as: (19)
Confidence Interval on Fitted Values
A 100() percent confidence interval on any fitted value, , is obtained as follows: (20)
It can be seen that the width of the confidence interval depends on the value of and will be a minimum at and will widen as increases.
Confidence Interval on New Observations
For the data in Table 4.1, assume that a new value of the yield is observed after the regression model is fit to the data. This new observation is independent of the observations used to obtain the regression model. If is the level of the temperature at which the new observation was taken, then the estimate for this new value based on the fitted regression model is:
If a confidence interval needs to be obtained on , then this interval should include both the error from the fitted model and the error associated with future observations. This is because represents the estimate for a value of that was not used to obtain the regression model. The confidence interval on is referred to as the prediction interval A 100() percent prediction interval on a new observation is obtained as follows:
(21)
Example 4.3
To illustrate the calculation of confidence intervals, the 95% confidence intervals on the response at for the data in Table 4.1 is obtained in this example. A 95% prediction interval is also obtained assuming that a new observation for the yield was made at .
The fitted value, , corresponding to is:
The 95% confidence interval on the fitted value, , is:
The 95% limits on are 199.95 and 205.2, respectively.
The estimated value based on the fitted regression model for the new observation at is:
The 95% prediction interval on is:
The 95% limits on are 189.9 and 207.2, respectively. In DOE++, confidence and prediction intervals are available using the Prediction icon in the Control Panel. The prediction interval values calculated in this example are shown in Figure 4.11 as Low PI and High PI respectively. The columns labeled Mean Predicted and Standard Error represent the values of and the standard error used in the calculations.
Figure 4.11: Calculation of prediction intervals in DOE++.
Measures of Model Adequacy
It is important to analyze the regression model before inferences based on the model are undertaken. The following sections present some techniques that can be used to check the appropriateness of the model for the given data. These techniques help to determine if any of the model assumptions have been violated.
Coefficient of Determination (R2)
The coefficient of determination is a measure of the amount of variability in the data accounted for by the regression model. As mentioned previously, the total variability of the data is measured by the total sum of squares, . The amount of this variability explained by the regression model is the regression sum of squares, . The coefficient of determination is the ratio of the regression sum of squares to the total sum of squares. (22)
can take on values between 0 and 1 since . For the yield data example, can be calculated as:
Therefore, 98% of the variability in the yield data is explained by the regression model, indicating a very good fit of the model. It may appear that larger values of indicate a better fitting regression model. However, should be used cautiously as this is not always the case. The value of increases as more terms are added to the model, even if the new term does not contribute significantly to the model. Therefore, an increase in the value of cannot be taken as a sign to conclude that the new model is superior to the older model. Adding a new term may make the regression model worse if the error mean square, , for the new model is larger than the of the older model, even though the new model will show an increased value of . In the results obtained from DOE++, is displayed as R-sq under the ANOVA table (as shown in Figure 4.12, which displays the complete analysis sheet for the data in Table 4.1).
The other values displayed with are S, R-sq(adj), PRESS and R-sq(pred). These values measure different aspects of the adequacy of the regression model. For example, the value of S is the square root of the error mean square, , and represents the "standard error of the model." A lower value of S indicates a better fitting model. The values of S, R-sq and R-sq(adj) indicate how well the model fits the observed data. The values of PRESS and R-sq(pred) are indicators of how well the regression model predicts new observations. R-sq(adj), PRESS and R-sq(pred) are explained in Chapter 5, Multiple Linear Regression Analysis.
Figure 4.12: Complete analysis for the data in Table 4.1.
Residual Analysis
In the simple linear regression model the true error terms, , are never known. The residuals, , may be thought of as the observed error terms that are similar to the true error terms. Since the true error terms, are assumed to be normally distributed with a mean of zero and a variance of , in a good model the observed error terms, (i.e. the residuals, ,) should also follow these assumptions. [Note] Thus the residuals in the simple linear regression should be normally distributed with a mean of zero and a constant variance of . Residuals are usually plotted against the fitted values, , against the predictor variable values, , and against time or run-order sequence, in addition to the normal probability plot. Plots of residuals are used to check for the following:
1. Residuals follow the normal distribution.
2. Residuals have a constant variance.
3. Regression function is linear.
4. A pattern does not exist when residuals are plotted in a time or run-order sequence.
5. There are no outliers.
Examples of residual plots are shown in Figure 4.13. The plot of Figure 4.13 (a) is a satisfactory plot with the residuals falling in a horizontal band with no systematic pattern. Such a plot indicates an appropriate regression model. The plot of Figure 4.13 (b) shows residuals falling in a funnel shape. Such a plot indicates increase in variance of residuals and the assumption of constant variance is violated here. Transformation on may be helpful in this case (see Chapter 4, Transformations). If the residuals follow the pattern of Figure 4.13 (c) or (d) then this is an indication that the linear regression model is not adequate. Addition of higher order terms to the regression model or transformation on or may be required in such cases. A plot of residuals may also show a pattern as seen in Figure 4.13 (e) indicating that the residuals increase (or decrease) as the run order sequence or time progresses. This may be due to factors such as operator-learning or instrument-creep and should be investigated further.
Figure 4.13: Possible residual plots (against fitted values, time or run-order) that can be obtained from simple linear regression analysis.
Example 4.4
Residual plots for the data of Table 4.1 are shown in Figures 4.14 to 4.16. Figure 4.14 is the normal probability plot. It can be observed that the residuals follow the normal distribution and the assumption of normality is valid here. In Figure 4.15 the residuals are plotted against the fitted values, , and in Figure 4.16 the residuals are plotted against the run order. Both of these plots show that the 21st observation seems to be an outlier. Further investigations are needed to study the cause of this outlier.
Figure 4.14: Normal probability plot of residuals for the data in Table 4.1.
Figure: 4.15: Plot of residuals against fitted values for the data in Table 4.1.
Figure 4.16: Plot of residuals against run order for the data in Table 4.1.
Lack-of-Fit Test
As mentioned in Chapter 4, Analysis of Variance Approach to Test the Significance of Regression, a perfect regression model results in a fitted line that passes exactly through all observed data points. This perfect model will give us a zero error sum of squares (). Thus, no error exists for the perfect model. However, if you record the response values for the same values of for a second time, in conditions maintained as strictly identical as possible to the first time, observations from the second time will not all fall along the perfect model. The deviations in observations recorded for the second time constitute the "purely" random variation or noise. The sum of squares due to pure error (abbreviated ) quantifies these variations. is calculated by taking repeated observations at some or all values of and adding up the square of deviations at each level of using the respective repeated observations at that value. Assume that there are levels of and repeated observations are taken at each th level. The data is collected as shown next:
The sum of squares of the deviations from the mean of the observations at th level of , , can be calculated as:
where is the mean of the repeated observations corresponding to (). The number of degrees of freedom for these deviations is () as there are observations at th level of but one degree of freedom is lost in calculating the mean, .
The total sum of square deviations (or ) for all levels of can be obtained by summing the deviations for all as shown next: (23)
The total number of degrees of freedom associated with is:
If all , (i.e. repeated observations are taken at all levels of ), then and the degrees of freedom associated with are: [Note]
The corresponding mean square in this case will be:
(24)
When repeated observations are used for a perfect regression model, the sum of squares due to pure error, , is also considered as the error sum of squares, . For the case when repeated observations are used with imperfect regression models, there are two components of the error sum of squares, . One portion is the pure error due to the repeated observations. The other portion is the error that represents variation not captured because of the imperfect model. The second portion is termed as the sum of squares due to lack-of-fit (abbreviated ) to point to the deficiency in fit due to departure from the perfect-fit model. Thus, for an imperfect regression model: (25)
Knowing and , the previous equation can be used to obtain :
The degrees of freedom associated with can be obtained in a similar manner using subtraction. For the case when repeated observations are taken at all levels of , the number of degrees of freedom associated with is:
Since there are total observations, the number of degrees of freedom associated with is:
Therefore, the number of degrees of freedom associated with is:
The corresponding mean square, , can now be obtained as:
(26)
The magnitude of or will provide an indication of how far the regression model is from the perfect model. An test exists to examine the lack-of-fit at a particular significance level. [Note] The quantity follows an distribution with degrees of freedom in the numerator and degrees of freedom in the denominator when all equal . The test statistic for the lack-of-fit test is:
If the critical value is such that:
it will lead to the rejection of the hypothesis that the model adequately fits the data.
Example 4.5
Assume that a second set of observations are taken for the yield data of Table 4.1. The resulting observations are recorded in Table 4.2. To conduct a lack-of-fit test on this data, the statistic , can be calculated as shown next.
Table 4.2: Yield data from the first and second observation sets for the chemical process example in Chapter 4.1.
Calculation of Least Square Estimates
The parameters of the fitted regression model can be obtained using Eqns. (3) and (2) as:
Knowing and , the fitted values, , can be calculated.
Calculation of the Sum of Squares
Using the fitted values, the sum of squares can be obtained as follows:
Calculation of
The error sum of squares, , can now be split into the sum of squares due to pure error, , and the sum of squares due to lack-of-fit, . can be calculated as follows considering that in this example and :
The number of degrees of freedom associated with is:
The corresponding mean square, , can now be obtained as:
can be obtained by subtraction from as:
Similarly, the number of degrees of freedom associated with is:
The lack-of-fit mean square is:
Calculation of the Test Statistic
The test statistic for the lack-of-fit test can now be calculated as:
The critical value for this test is:
Since , we fail to reject the hypothesis that the model adequately fits the data. The value for this case is:
Therefore, at a significance level of 0.05 we conclude that the simple linear regression model, , is adequate for the observed data. Table 4.3 presents a summary of the ANOVA calculations for the lack-of-fit test.
Table 4.3: ANOVA table for the lack-of-fit test of the yield data example.
Transformations
The linear regression model may not be directly applicable to certain data. Non-linearity may be detected from scatter plots or may be known through the underlying theory of the product or process or from past experience. Transformations on either the predictor variable, , or the response variable, , may often be sufficient to make the linear regression model appropriate for the transformed data.
If it is known that the data follows the logarithmic distribution, then a logarithmic transformation on (i.e. ) might be useful. For data following the Poisson distribution, a square root transformation () is generally applicable.
Transformations on may also be applied based on the type of scatter plot obtained from the data. Figure 4.17 shows a few such examples. For the scatter plot of Figure (a), a square root transformation () is applicable. While for Figure (b), a logarithmic transformation (i.e. ) may be applied. For Figure (c), the reciprocal transformation () is applicable. At times it may be helpful to introduce a constant into the transformation of . For example, if is negative and the logarithmic transformation on seems applicable, a suitable constant, , may be chosen to make all observed positive. Thus the transformation in this case would be .
Figure 4.17: Transformations on for a few possible scatter plots. Plot (a) may require , (b) may require and (c) may require .
The Box-Cox method may also be used to automatically identify a suitable power transformation for the data based on the relation:
Here the parameter is determined using the given data such that is minimized (details on this method are presented in Chapter 6).