Simple Linear Regression Analysis: Difference between revisions
Lisa Hacker (talk | contribs) No edit summary |
|||
(30 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{Template:Doebook|3}} | {{Template:Doebook|3}} | ||
Regression analysis is a statistical technique that attempts to explore and model the relationship between two or more variables. For example, an analyst may want to know if there is a relationship between road accidents and the age of the driver. Regression analysis forms an important part of the statistical analysis of the data obtained from designed experiments and is discussed briefly in this chapter. Every experiment analyzed in | Regression analysis is a statistical technique that attempts to explore and model the relationship between two or more variables. For example, an analyst may want to know if there is a relationship between road accidents and the age of the driver. Regression analysis forms an important part of the statistical analysis of the data obtained from designed experiments and is discussed briefly in this chapter. Every experiment analyzed in a [https://koi-3QN72QORVC.marketingautomation.services/net/m?md=Rw01CJDOxn%2FabhkPlZsy6DwBQ%2BaCXsGR Weibull++] DOE foilo includes regression results for each of the responses. These results, along with the results from the analysis of variance (explained in the [[One Factor Designs]] and [[General Full Factorial Designs]] chapters), provide information that is useful to identify significant factors in an experiment and explore the nature of the relationship between these factors and the response. Regression analysis forms the basis for all [https://koi-3QN72QORVC.marketingautomation.services/net/m?md=Rw01CJDOxn%2FabhkPlZsy6DwBQ%2BaCXsGR Weibull++] DOE folio calculations related to the sum of squares used in the analysis of variance. The reason for this is explained in [[Use_of_Regression_to_Calculate_Sum_of_Squares|Appendix B]]. Additionally, DOE folios also include a regression tool to see if two or more variables are related, and to explore the nature of the relationship between them. | ||
This chapter discusses simple linear regression analysis while a [[Multiple_Linear_Regression_Analysis|subsequent chapter]] focuses on multiple linear regression analysis. | This chapter discusses simple linear regression analysis while a [[Multiple_Linear_Regression_Analysis|subsequent chapter]] focuses on multiple linear regression analysis. | ||
Line 11: | Line 11: | ||
This data can be entered in DOE | This data can be entered in the DOE folio as shown in the following figure: | ||
[[Image:doe4_1.png|center| | [[Image:doe4_1.png|center|530px|Data entry in the DOE folio for the observations.|link=]] | ||
Line 115: | Line 115: | ||
In DOE | In DOE folios, fitted values and residuals can be calculated. The values are shown in the figure below. | ||
[[Image:doe4_5.png|center| | [[Image:doe4_5.png|center|880px|Fitted values and residuals for the data.|link=]] | ||
==Hypothesis Tests in Simple Linear Regression== | ==Hypothesis Tests in Simple Linear Regression== | ||
Line 199: | Line 199: | ||
Assuming that the desired significance level is 0.1, since <math>p\,\!</math> value < 0.1, <math>H_0 : \beta_1=0\,\!</math> is rejected indicating that a relation exists between temperature and yield for the data in the preceding [[Simple_Linear_Regression_Analysis#Simple_Linear_Regression_Analysis| table]]. Using this result along with the scatter plot, it can be concluded that the relationship between temperature and yield is linear. | Assuming that the desired significance level is 0.1, since <math>p\,\!</math> value < 0.1, <math>H_0 : \beta_1=0\,\!</math> is rejected indicating that a relation exists between temperature and yield for the data in the preceding [[Simple_Linear_Regression_Analysis#Simple_Linear_Regression_Analysis| table]]. Using this result along with the scatter plot, it can be concluded that the relationship between temperature and yield is linear. | ||
In | In Weibull++ DOE folios, information related to the <math>t\,\!</math> test is displayed in the Regression Information table as shown in the following figure. In this table the <math>t\,\!</math> test for <math>\beta_1\,\!</math> is displayed in the row for the term Temperature because <math>\beta_1\,\!</math> is the coefficient that represents the variable temperature in the regression model. The columns labeled Standard Error, T Value and P Value represent the standard error, the test statistic for the test and the <math>p\,\!</math> value for the <math>t\,\!</math> test, respectively. These values have been calculated for <math>\beta_1\,\!</math> in this example. The Coefficient column represents the estimate of regression coefficients. The Effect column represents values obtained by multiplying the coefficients by a factor of 2. This value is useful in the case of two factor experiments and is explained in [[Two_Level_Factorial_Experiments| Two Level Factorial Experiments]]. Columns Low Confidence and High Confidence represent the limits of the confidence intervals for the regression coefficients and are explained in [[Simple_Linear_Regression_Analysis#Confidence_Interval_on_Regression_Coefficients|Confidence Interval on Regression Coefficients]]. | ||
[[Image:doe4_7.png|center| | [[Image:doe4_7.png|center|826px|Regression results for the data.|link=]] | ||
===Analysis of Variance Approach to Test the Significance of Regression=== | ===Analysis of Variance Approach to Test the Significance of Regression=== | ||
Line 361: | Line 361: | ||
Assuming that the desired significance is 0.1, since the <math>p\,\!</math> value < 0.1, then <math>{{H}_{0}}:{{\beta }_{1}}=0\,\!</math> is rejected, implying that a relation does exist between temperature and yield for the data in the preceding [[Simple_Linear_Regression_Analysis#Simple_Linear_Regression_Analysis| table]]. Using this result along with the scatter plot of the above [[Simple_Linear_Regression_Analysis#Simple_Linear_Regression_Analysis| figure]], it can be concluded that the relationship that exists between temperature and yield is linear. This result is displayed in the ANOVA table as shown in the following figure. Note that this is the same result that was obtained from the <math>t\,\!</math> test in the section [[Simple_Linear_Regression_Analysis#Tests|t Tests]]. The ANOVA and Regression Information tables in | Assuming that the desired significance is 0.1, since the <math>p\,\!</math> value < 0.1, then <math>{{H}_{0}}:{{\beta }_{1}}=0\,\!</math> is rejected, implying that a relation does exist between temperature and yield for the data in the preceding [[Simple_Linear_Regression_Analysis#Simple_Linear_Regression_Analysis| table]]. Using this result along with the scatter plot of the above [[Simple_Linear_Regression_Analysis#Simple_Linear_Regression_Analysis| figure]], it can be concluded that the relationship that exists between temperature and yield is linear. This result is displayed in the ANOVA table as shown in the following figure. Note that this is the same result that was obtained from the <math>t\,\!</math> test in the section [[Simple_Linear_Regression_Analysis#Tests|t Tests]]. The ANOVA and Regression Information tables in Weibull++ DOE folios represent two different ways to test for the significance of the regression model. In the case of multiple linear regression models these tables are expanded to allow tests on individual variables used in the model. This is done using extra sum of squares. Multiple linear regression models and the application of extra sum of squares in the analysis of these models are discussed in [[Multiple_Linear_Regression_Analysis| Multiple Linear Regression Analysis]]. | ||
[[Image:doe4_10.png|center| | [[Image:doe4_10.png|center|747px| ANOVA table for the data.|link=]] | ||
==Confidence Intervals in Simple Linear Regression== | ==Confidence Intervals in Simple Linear Regression== | ||
A confidence interval represents a closed interval where a certain percentage of the population is likely to lie. For example, a 90% confidence interval with a lower limit of <math>A\,\!</math> and an upper limit of <math>B\,\!</math> implies that 90% of the population lies between the values of <math>A\,\!</math> and <math>B\,\!</math>. Out of the remaining 10% of the population, 5% is less than <math>A\,\!</math> and 5% is greater than <math>B\,\!</math>. (For details refer to the | A confidence interval represents a closed interval where a certain percentage of the population is likely to lie. For example, a 90% confidence interval with a lower limit of <math>A\,\!</math> and an upper limit of <math>B\,\!</math> implies that 90% of the population lies between the values of <math>A\,\!</math> and <math>B\,\!</math>. Out of the remaining 10% of the population, 5% is less than <math>A\,\!</math> and 5% is greater than <math>B\,\!</math>. (For details refer to the Life data analysis reference.) This section discusses confidence intervals used in simple linear regression analysis. | ||
===Confidence Interval on Regression Coefficients=== | ===Confidence Interval on Regression Coefficients=== | ||
Line 458: | Line 458: | ||
The 95% limits on <math>{{\hat{y}}_{p}}\,\!</math> are 189.9 and 207.2, respectively. In | The 95% limits on <math>{{\hat{y}}_{p}}\,\!</math> are 189.9 and 207.2, respectively. In Weibull++ DOE folios, confidence and prediction intervals can be calculated from the control panel. The prediction interval values calculated in this example are shown in the figure below as Low Prediction Interval and High Prediction Interval, respectively. The columns labeled Mean Predicted and Standard Error represent the values of <math>{{\hat{y}}_{p}}\,\!</math> and the standard error used in the calculations. | ||
[[Image:doe4_11.png|center| | [[Image:doe4_11.png|center|786px|Calculation of prediction intervals in Weibull++.|link=]] | ||
==Measures of Model Adequacy== | ==Measures of Model Adequacy== | ||
Line 485: | Line 485: | ||
Therefore, 98% of the variability in the yield data is explained by the regression model, indicating a very good fit of the model. It may appear that larger values of <math>{{R}^{2}}\,\!</math> indicate a better fitting regression model. However, <math>{{R}^{2}}\,\!</math> should be used cautiously as this is not always the case. The value of <math>{{R}^{2}}\,\!</math> increases as more terms are added to the model, even if the new term does not contribute significantly to the model. Therefore, an increase in the value of <math>{{R}^{2}}\,\!</math> cannot be taken as a sign to conclude that the new model is superior to the older model. Adding a new term may make the regression model worse if the error mean square, <math>M{{S}_{E}}\,\!</math>, for the new model is larger than the <math>M{{S}_{E}}\,\!</math> of the older model, even though the new model will show an increased value of <math>{{R}^{2}}\,\!</math>. In the results obtained from DOE | Therefore, 98% of the variability in the yield data is explained by the regression model, indicating a very good fit of the model. It may appear that larger values of <math>{{R}^{2}}\,\!</math> indicate a better fitting regression model. However, <math>{{R}^{2}}\,\!</math> should be used cautiously as this is not always the case. The value of <math>{{R}^{2}}\,\!</math> increases as more terms are added to the model, even if the new term does not contribute significantly to the model. Therefore, an increase in the value of <math>{{R}^{2}}\,\!</math> cannot be taken as a sign to conclude that the new model is superior to the older model. Adding a new term may make the regression model worse if the error mean square, <math>M{{S}_{E}}\,\!</math>, for the new model is larger than the <math>M{{S}_{E}}\,\!</math> of the older model, even though the new model will show an increased value of <math>{{R}^{2}}\,\!</math>. In the results obtained from the DOE folio, <math>{{R}^{2}}\,\!</math> is displayed as R-sq under the ANOVA table (as shown in the figure below), which displays the complete analysis sheet for the data in the preceding [[Simple_Linear_Regression_Analysis#Simple_Linear_Regression_Analysis| table]]. | ||
The other values displayed with are S, R-sq(adj), PRESS and R-sq(pred). These values measure different aspects of the adequacy of the regression model. For example, the value of S is the square root of the error mean square, <math>MS_E\,\!</math>, and represents the "standard error of the model." A lower value of S indicates a better fitting model. The values of S, R-sq and R-sq(adj) indicate how well the model fits the observed data. The values of PRESS and R-sq(pred) are indicators of how well the regression model predicts new observations. R-sq(adj), PRESS and R-sq(pred) are explained in [[Multiple Linear Regression Analysis]]. | The other values displayed with are S, R-sq(adj), PRESS and R-sq(pred). These values measure different aspects of the adequacy of the regression model. For example, the value of S is the square root of the error mean square, <math>MS_E\,\!</math>, and represents the "standard error of the model." A lower value of S indicates a better fitting model. The values of S, R-sq and R-sq(adj) indicate how well the model fits the observed data. The values of PRESS and R-sq(pred) are indicators of how well the regression model predicts new observations. R-sq(adj), PRESS and R-sq(pred) are explained in [[Multiple Linear Regression Analysis]]. | ||
[[Image:doe4_12.png|center| | [[Image:doe4_12.png|center|874px|Complete analysis for the data.|link=]] | ||
===Residual Analysis=== | ===Residual Analysis=== | ||
Line 506: | Line 506: | ||
[[Image:doe4.13.png|center| | [[Image:doe4.13.png|center|550px|Possible residual plots (against fitted values, time or run-order) that can be obtained from simple linear regression analysis.|link=]] | ||
Line 514: | Line 514: | ||
[[Image:doe4_14.png|center| | [[Image:doe4_14.png|center|650px|Normal probability plot of residuals for the data.|link=]] | ||
[[Image:doe4_15.png|center| | [[Image:doe4_15.png|center|650px|Plot of residuals against fitted values for the data.|link=]] | ||
[[Image:doe4_16.png|center| | [[Image:doe4_16.png|center|650px|Plot of residuals against run order for the data.|link=]] | ||
===Lack-of-Fit Test=== | ===Lack-of-Fit Test=== | ||
Line 633: | Line 633: | ||
Assume that a second set of observations are taken for the yield data of the preceding [http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis#Simple_Linear_Regression_Analysis| table]. The resulting observations are recorded in the following table. To conduct a lack-of-fit test on this data, the statistic <math>{{F}_{0}}=M{{S}_{LOF}}/M{{S}_{PE}}\,\!</math>, can be calculated as shown next. | Assume that a second set of observations are taken for the yield data of the preceding [http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis#Simple_Linear_Regression_Analysis| table]. The resulting observations are recorded in the following table. To conduct a lack-of-fit test on this data, the statistic <math>{{F}_{0}}=M{{S}_{LOF}}/M{{S}_{PE}}\,\!</math>, can be calculated as shown next. | ||
[[Image:doet4.2.png|center| | [[Image:doet4.2.png|center|436px|Yield data from the first and second observation sets for the chemical process example in the Introduction.|link=]] | ||
Line 781: | Line 781: | ||
[[Image:doe4.18.png|center| | [[Image:doe4.18.png|center|700px|ANOVA table for the lack-of-fit test of the yield data example.]] | ||
==Transformations== | ==Transformations== | ||
Line 790: | Line 790: | ||
[[Image:doe4.17.png|center| | [[Image:doe4.17.png|center|500px|Transformations on for a few possible scatter plots. Plot (a) may require a square root transformation, (b) may require a logarithmic transformation and (c) may require a reciprocal transformation.|link=]] | ||
Latest revision as of 18:47, 15 September 2023
Regression analysis is a statistical technique that attempts to explore and model the relationship between two or more variables. For example, an analyst may want to know if there is a relationship between road accidents and the age of the driver. Regression analysis forms an important part of the statistical analysis of the data obtained from designed experiments and is discussed briefly in this chapter. Every experiment analyzed in a Weibull++ DOE foilo includes regression results for each of the responses. These results, along with the results from the analysis of variance (explained in the One Factor Designs and General Full Factorial Designs chapters), provide information that is useful to identify significant factors in an experiment and explore the nature of the relationship between these factors and the response. Regression analysis forms the basis for all Weibull++ DOE folio calculations related to the sum of squares used in the analysis of variance. The reason for this is explained in Appendix B. Additionally, DOE folios also include a regression tool to see if two or more variables are related, and to explore the nature of the relationship between them.
This chapter discusses simple linear regression analysis while a subsequent chapter focuses on multiple linear regression analysis.
Simple Linear Regression Analysis
A linear regression model attempts to explain the relationship between two or more variables using a straight line. Consider the data obtained from a chemical process where the yield of the process is thought to be related to the reaction temperature (see the table below).

This data can be entered in the DOE folio as shown in the following figure:

And a scatter plot can be obtained as shown in the following figure. In the scatter plot yield,

It is clear that no line can be found to pass through all points of the plot. Thus no functional relation exists between the two variables
The above equation is the linear regression model that can be used to explain the relation between
The actual values of
The regression model here is called a simple linear regression model because there is just one independent variable,
The random error term,
![The normal distribution of [math]\displaystyle{ Y\,\! }[/math] for two values of [math]\displaystyle{ x\,\! }[/math]. Also shown is the true regression line and the values of the random error term, [math]\displaystyle{ \epsilon\,\! }[/math], corresponding to the two [math]\displaystyle{ x\,\! }[/math] values. The true regression line and [math]\displaystyle{ \epsilon\,\! }[/math] are usually not known. The normal distribution of [math]\displaystyle{ Y\,\! }[/math] for two values of [math]\displaystyle{ x\,\! }[/math]. Also shown is the true regression line and the values of the random error term, [math]\displaystyle{ \epsilon\,\! }[/math], corresponding to the two [math]\displaystyle{ x\,\! }[/math] values. The true regression line and [math]\displaystyle{ \epsilon\,\! }[/math] are usually not known.](/images/2/28/Doe4.3.png)
Fitted Regression Line
The true regression line is usually not known. However, the regression line can be estimated by estimating the coefficients
where
Once
where
Calculation of the Fitted Line Using Least Square Estimates
The least square estimates of the regression coefficients can be obtained for the data in the preceding table as follows:
Knowing
This line is shown in the figure below.

Once the fitted regression line is known, the fitted value of
The observed response at this point is
In DOE folios, fitted values and residuals can be calculated. The values are shown in the figure below.

Hypothesis Tests in Simple Linear Regression
The following sections discuss hypothesis tests on the regression coefficients in simple linear regression. These tests can be carried out if it can be assumed that the random error term,
t Tests
The
The test statistic used for this test is:
where
The test statistic,
where
If the value of
![Possible scatter plots of [math]\displaystyle{ y\,\! }[/math] against [math]\displaystyle{ x\,\! }[/math]. Plots (a) and (b) represent cases when [math]\displaystyle{ H_0:\beta_1=0\,\! }[/math] is not rejected. Plots (c) and (d) represent cases when [math]\displaystyle{ H_0:\beta_1=0\,\! }[/math] is rejected. Possible scatter plots of [math]\displaystyle{ y\,\! }[/math] against [math]\displaystyle{ x\,\! }[/math]. Plots (a) and (b) represent cases when [math]\displaystyle{ H_0:\beta_1=0\,\! }[/math] is not rejected. Plots (c) and (d) represent cases when [math]\displaystyle{ H_0:\beta_1=0\,\! }[/math] is rejected.](/images/9/96/Doe4.6.png)
A similar procedure can be used to test the hypothesis on the intercept. The test statistic used in this case is:
where
Example
The test for the significance of regression for the data in the preceding table is illustrated in this example. The test is carried out using the
Then, the test statistic can be calculated using the following equation:
The
Assuming that the desired significance level is 0.1, since
In Weibull++ DOE folios, information related to the

Analysis of Variance Approach to Test the Significance of Regression
The analysis of variance (ANOVA) is another method to test for the significance of regression. As the name implies, this approach uses the variance of the observed data to determine if a regression model can be applied to the observed data. The observed variance is partitioned into components that are then used in the test for significance of regression.
Sum of Squares
The total variance (i.e., the variance of all of the observed data) is estimated using the observed data. As mentioned in Statistical Background, the variance of a population can be estimated using the sample variance, which is calculated using the following relationship:
The quantity in the numerator of the previous equation is called the sum of squares. It is the sum of the square of deviations of all the observations,
The denominator in the relationship of the sample variance is the number of degrees of freedom associated with the sample variance. Therefore, the number of degrees of freedom associated with
When you attempt to fit a regression model to the observations, you are trying to explain some of the variation of the observations using this model. If the regression model is such that the resulting fitted regression line passes through all of the observations, then you would have a "perfect" model (see (a) of the figure below). In this case the model would explain all of the variability of the observations. Therefore, the model sum of squares (also referred to as the regression sum of squares and abbreviated
For the perfect model, the regression sum of squares,
The number of degrees of freedom associated with
Based on the preceding discussion of ANOVA, a perfect regression model exists when the fitted regression line passes through all observed points. However, this is not usually the case, as seen in (b) of the following figure.

In both of these plots, a number of points do not follow the fitted regression line. This indicates that a part of the total variability of the observed data still remains unexplained. This portion of the total variability or the total sum of squares, that is not explained by the model, is called the residual sum of squares or the error sum of squares (abbreviated
The number of degrees of freedom associated with
The above equation is also referred to as the analysis of variance identity and can be expanded as follows:
![Scatter plots showing the deviations for the sum of squares used in ANOVA. (a) shows deviations for [math]\displaystyle{ S S_{T}\,\! }[/math], (b) shows deviations for [math]\displaystyle{ S S_{R}\,\! }[/math], and (c) shows deviations for [math]\displaystyle{ S S_{E}\,\! }[/math]. Scatter plots showing the deviations for the sum of squares used in ANOVA. (a) shows deviations for [math]\displaystyle{ S S_{T}\,\! }[/math], (b) shows deviations for [math]\displaystyle{ S S_{R}\,\! }[/math], and (c) shows deviations for [math]\displaystyle{ S S_{E}\,\! }[/math].](/images/thumb/8/81/Doe4.9.png/600px-Doe4.9.png)
Mean Squares
As mentioned previously, mean squares are obtained by dividing the sum of squares by the respective degrees of freedom. For example, the error mean square,
The error mean square is an estimate of the variance,
Similarly, the regression mean square,
F Test
To test the hypothesis
follows the
where
Example
The analysis of variance approach to test the significance of regression can be applied to the yield data in the preceding table. To calculate the statistic,
The regression sum of squares can be calculated as:
The error sum of squares can be calculated as:
Knowing the sum of squares, the statistic to test
The critical value at a significance level of 0.1 is
Assuming that the desired significance is 0.1, since the

Confidence Intervals in Simple Linear Regression
A confidence interval represents a closed interval where a certain percentage of the population is likely to lie. For example, a 90% confidence interval with a lower limit of
Confidence Interval on Regression Coefficients
A 100 (
Similarly, a 100 (
Confidence Interval on Fitted Values
A 100 (
It can be seen that the width of the confidence interval depends on the value of
Confidence Interval on New Observations
For the data in the preceding table, assume that a new value of the yield is observed after the regression model is fit to the data. This new observation is independent of the observations used to obtain the regression model. If
If a confidence interval needs to be obtained on
Example
To illustrate the calculation of confidence intervals, the 95% confidence intervals on the response at
The fitted value,
The 95% confidence interval
The 95% limits on
The 95% prediction interval on
The 95% limits on

Measures of Model Adequacy
It is important to analyze the regression model before inferences based on the model are undertaken. The following sections present some techniques that can be used to check the appropriateness of the model for the given data. These techniques help to determine if any of the model assumptions have been violated.
Coefficient of Determination ( )
The coefficient of determination is a measure of the amount of variability in the data accounted for by the regression model. As mentioned previously, the total variability of the data is measured by the total sum of squares,
Therefore, 98% of the variability in the yield data is explained by the regression model, indicating a very good fit of the model. It may appear that larger values of
The other values displayed with are S, R-sq(adj), PRESS and R-sq(pred). These values measure different aspects of the adequacy of the regression model. For example, the value of S is the square root of the error mean square,

Residual Analysis
In the simple linear regression model the true error terms,
- 1. Residuals follow the normal distribution.
- 2. Residuals have a constant variance.
- 3. Regression function is linear.
- 4. A pattern does not exist when residuals are plotted in a time or run-order sequence.
- 5. There are no outliers.
Examples of residual plots are shown in the following figure. (a) is a satisfactory plot with the residuals falling in a horizontal band with no systematic pattern. Such a plot indicates an appropriate regression model. (b) shows residuals falling in a funnel shape. Such a plot indicates increase in variance of residuals and the assumption of constant variance is violated here. Transformation on

Example
Residual plots for the data of the preceding table are shown in the following figures. One of the following figures is the normal probability plot. It can be observed that the residuals follow the normal distribution and the assumption of normality is valid here. In one of the following figures the residuals are plotted against the fitted values,



Lack-of-Fit Test
As mentioned in Analysis of Variance Approach, ANOVA, a perfect regression model results in a fitted line that passes exactly through all observed data points. This perfect model will give us a zero error sum of squares (
Assume that there are
The sum of squares of the deviations from the mean of the observations at
where
The total sum of square deviations (or
The total number of degrees of freedom associated with
If all
The corresponding mean square in this case will be:
When repeated observations are used for a perfect regression model, the sum of squares due to pure error,
Knowing
The degrees of freedom associated with
Since there are
Therefore, the number of degrees of freedom associated with
The corresponding mean square,
The magnitude of
If the critical value
it will lead to the rejection of the hypothesis that the model adequately fits the data.
Example
Assume that a second set of observations are taken for the yield data of the preceding table. The resulting observations are recorded in the following table. To conduct a lack-of-fit test on this data, the statistic

Calculation of Least Square Estimates
The parameters of the fitted regression model can be obtained as:
Knowing
Calculation of the Sum of Squares
Using the fitted values, the sum of squares can be obtained as follows:
Calculation of
The error sum of squares,
The number of degrees of freedom associated with
The corresponding mean square,
Similarly, the number of degrees of freedom associated with
The lack-of-fit mean square is:
Calculation of the Test Statistic
The test statistic for the lack-of-fit test can now be calculated as:
The critical value for this test is:
Since
Therefore, at a significance level of 0.05 we conclude that the simple linear regression model,
Transformations
The linear regression model may not be directly applicable to certain data. Non-linearity may be detected from scatter plots or may be known through the underlying theory of the product or process or from past experience. Transformations on either the predictor variable,
Transformations on

For the scatter plot labeled (a), a square root transformation (
The Box-Cox method may also be used to automatically identify a suitable power transformation for the data based on the relation:
Here the parameter