next up previous
Next: About this document ...

Statistics 2501 (001)
Assignment #4: Solutions



  1. A first-order linear regression model with 6 explanatory variables was fit to a data set of 33 observations. Complete the following ANOVA table:

     SOURCE       DF          SS          MS         F        p
     
     Regression   6          60          10       13.00     0.000 
                 ---       -----                  -----     -----
     
     Error       26          20        0.769 
                ----                   -----   
     
     Total       32          80 
               -----       -----
    
    Reasons:

    \begin{displaymath}
6 = k = \mbox{ Number of explanatory variables}, \quad
32 = n - 1 = 33 - 1, \quad
26 = n - k - 1
\end{displaymath}


    \begin{displaymath}
SS(Model) = 60 = MS(Model)k = 10(6), \quad
SS_{yy} = SS(Model) + SSE = 60 + 20
\end{displaymath}


    \begin{displaymath}
MSE = SSE/(n - k - 1) = 20/26, \quad
F = MS(Model)/MSE = 10/0.769
\end{displaymath}

    Finally, from Minitab, we need P($F > 13.00$). Following the instructions on the handout, we find that:
          x    P (X <= x)
      13.00         1.000        # So P(F <= 13.00) = 1 (approx). 
    
    # So P(F > 13.00) = 1 - 1  = 0.000 (approx).
    
    So p-value = 0.000 (approx.).

  2. Refer to Temco company data in Minitab.

    1. Fit the regression equation

      \begin{displaymath}
y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{1}^{2} +
\beta_{3}x_{2} + \beta_{4}x_{2}^{2} + e
\end{displaymath}

      to the data, and report the least squares line.

      Regression Analysis: Salary versus YrsEm, YrsEmsq, Educ, Educsq
       
       
      The regression equation is
      Salary = 24187 + 844 YrsEm - 7.7 YrsEmsq + 1112 Educ + 76.0 Educsq
      

      From the output, we see that

      \begin{displaymath}
\hat{y} =
24187 + 844 x_{1} - 7.7 x_{1}^{2} + 1112 x_{2} + 76.0 x_{2}^{2}
\end{displaymath}

    2. Is the model useful for predicting salary?

      $ H_{o}: \beta_{1} = \beta_{2}= \beta_{3} = \beta_{4} = 0$
      $H_{a}: \mbox{At least 1 } \beta_{i} \neq 0$

       Analysis of Variance
       
      Source            DF          SS          MS         F        P
      Regression         4  4052707001  1013176750     29.85    0.000
      Residual Error    41  1391544402    33940107
      Total             45  5444251403
      

      From the output

      \begin{displaymath}
F_{obs} = MS(Model)/MSE = 29.85, \quad
\mbox{p-value } = P(F_{obs} \geq 29.85) \approx 0
\end{displaymath}

      using $k = 4$ and $n - k -1 = 41$ df.

      Therefore we have very strong evidence against $H_{o}$.

      So it appears that the model is useful in predicting the salary.

    3. A 90% PI for salary when $x_{1} = 8$, $x_{2} = 4$.
      Predicted Values for New Observations
      
      New Obs     Fit     SE Fit         90.0% CI             90.0% PI
      1         36111       1135   (   34201,   38021)  (   26123,   46100)   
      
      Values of Predictors for New Observations
      
      New Obs     YrsEm     yrssq      Educ    educsq
      1            8.00      64.0      4.00      16.0
      
      From the output, we see that

      \begin{displaymath}
\hat{y} \pm t_{0.05} \sqrt{\hat{\sigma}^{2} + s^{2}_{\hat{y}}} =
(26123, 46100)
\end{displaymath}

    4. Residual plots, which are attached.

      NOTE: I will use the standardized residuals in my plots. It is perfectly fine to use the regular residuals, since the shape of the plots will be exactly the same. Therefore, full credit will be given if the regular residuals are used in the plots.

      First, the residuals vs. the $\hat{y}$ values. There does seem to be a pattern in this plot; as the fitted values ($\hat{y}$ values) increase, the residuals are getting further from 0. This indicates one of our model assumptions is probably violated. In particular, it suggests that our assumption that the errors all have the same standard deviation ($\sigma$) may not hold.

      In terms of potential outliers, there appear to be one or two, as we see one standardized residual around -3, and another around -2.5.

      Next, the QQ-plot to assess normality. A linear pattern in this plot may seem reasonable, if we ignore the 2 points to the far left of the plot (the potential outliers). If we do this, then the assumption of normality in the errors seems reasonable.

    5. Test at $\alpha = 0.1$ if we can drop $x_{2}$ and $x_{2}^{2}$ from the model.

      We are testing if two terms can be dropped. This is a partial F-test, and should not be done as two separate t-tests, one on each term. Three (3) points will be deducted if this is done.

      $H_{o}: \beta_{3} = \beta_{4} = 0$
      $H_{a}: \mbox{At least 1 } \beta_{i} \neq 0$

      There are 2 ways to find $SSE_{R} - SSE_{C}$. FULL CREDIT will be given to either approach.

      Method 1: Use the output from the complete model used in (a), where we have included the variables we want to drop last in the regress command. The required portion of the output is below:

       
      Source       DF      Seq SS
      YrsEm         1  3187557844
      YrsEmsq       1    12205007
      Educ          1   828603692
      Educsq        1    24340458
      
      Then

      \begin{displaymath}
SSE_{R} - SSE_{C} = 828603692 + 24340458 = 852944150
\end{displaymath}

      Method 2: From the output of the complete model in (a), we see that $SSE_{C} = 1391544402$. Now we use Minitab to fit the reduced model:

      \begin{displaymath}
y = \beta_{0} + \beta_{1}x_{1} \beta_{2}x_{1}^{2} + e
\end{displaymath}

       
      Regression Analysis: Salary versus YrsEm, YrsEmsq  
      
      Analysis of Variance
       
      Source            DF          SS          MS         F        P
      Regression         2  3199762850  1599881425     30.65    0.000
      Residual Error    43  2244488552    52197408
      Total             45  5444251403
      
      From this output, $SSE_{R} = 2244488552$, so

      \begin{displaymath}
SSE_{R} - SSE_{C} = 2244488552 - 1391544402 = 852944150
\end{displaymath}

      Then

      \begin{displaymath}
F_{obs} = \frac{ (SSE_{R} - SSE_{C})/(k - g) }
{ MSE_{C}}
=\frac{852944150/2}{33940107} = 12.57
\end{displaymath}

      Test at $\alpha = 0.1$: Reject $H_{o}$ if $ 12.57 > F_{0.1}$, using $(k - g) = 2$ (since there are 2 $\beta$'s in $H_{o}$) and $(n -k -1)= 41$ df.

      From the F-table: $F_{0.1} = 2.44$, using 2 and 40 df, since 41 df is not in the table.

      Since $2.44 < 12.57$, we reject $H_{o}$.

      Therefore it appears that we cannot drop the education terms.

  3. Thiamin content and cereal grain.

    1. What is the least squares line for predicting thiamin content?

      The model is

      \begin{displaymath}
y = \beta_{0} + \beta_{1}x_{1} +\beta_{2}x_{2} + \beta_{3}x_{3} + e
\end{displaymath}

      where

      \begin{displaymath}
x_{1} = \left\{ \begin{array}{ll}
1 & \mbox{ if wheat} \ ...
...box{ if maize} \\
0 & \mbox{ otherwise}
\end{array} \right.
\end{displaymath}

      Regression Analysis: Yield versus Wheat, Barley, Maize
      
       
      The regression equation is
      Yield = 6.98 - 1.27 Wheat - 0.383 Barley - 1.48 Maize
      

      From the output, the least squares line is

      \begin{displaymath}
\hat{y} = 6.98 -1.27x_{1} -0.383x_{2} - 1.48x_{3}
\end{displaymath}

    2. Show how $\hat{\beta}$ values are related to the average thiamin contents observed.

      NOTE: If someone just states $\beta_{0} = \mu_{4}$, $\beta_{1} = \mu_{1} - \mu_{4}$, etc., without showing any numerical calculations, take off one point for this question.

      I've used Minitab to find the 4 averages. However, it's fine if a calculator was used, so no Minitab output is shown.

      From the output below, we see that

      Wheat: $\bar{y}_{1} = 5.717$,
      Barley: $\bar{y}_{2} = 6.600$,
      Maize: $\bar{y}_{3} = 5.500$,
      Oats: $\bar{y}_{4} = 6.983$,

      and

      $ \hat{\beta}_{0} = 6.983 = \bar{y}_{4}$ (Oats)
      $ \hat{\beta}_{1} = -1.266 = \bar{y}_{1} -\bar{y}_{4}$ (Wheat - Oats)
      $ \hat{\beta}_{2} = -0.383 = \bar{y}_{2} -\bar{y}_{4}$ (Barley - Oats)
      $ \hat{\beta}_{3} = -0.1483 = \bar{y}_{3} -\bar{y}_{4}$ (Maize - Oats)

      Predictor        Coef     SE Coef          T        P
      Constant       6.9833      0.3552      19.66    0.000
      Wheat         -1.2667      0.5023      -2.52    0.020
      Barley        -0.3833      0.5023      -0.76    0.454
      Maize         -1.4833      0.5023      -2.95    0.008
      
      
      
       
       Variable        Type     N     Mean   
       
       yield          Wheat     6    5.717   
                     Barley     6    6.600    
                      Maize     6    5.500    
                       Oats     6    6.983
      

    3. Test if the model is useful. In addition, state the hypotheses in terms of the $\beta$'s and the $\mu$'s.

      $H_{o}: \beta_{1} = \beta_{2} = \beta_{3} = 0
\mbox{ (model not useful)}$
      $H_{a}: \mbox{ at least 1 } \beta_{i} \neq 0$

      Now we must express these in terms of the mean yields for the variety of peas. Although I will rewrite both $H_{o}$ and $H_{a}$, only the rewriting of $H_{o}$ will be graded.

      Suppose we let $\mu_{1} = $ mean content for wheat, $\ldots$, $\mu_{4} = $ mean content for oats. Then we can write
      $H_{o}: \mu_{1} = \mu_{2} = \mu_{3} = \mu_{4}
\mbox{ (all mean yields equal)}$
      $H_{a}$: at least 2 $\mu$'s differ.

      From the output,

      \begin{displaymath}
F_{obs} = 3.96, \quad
\mbox{p-value } = P(F_{obs} \geq 3.96) = 0.023
\end{displaymath}

      using the F-distribution with $k = 3$ and $(n-k-1) = 20$ df.

      Test at $\alpha = 0.05$: Since p-value = $0.023 < 0.05$, we reject $H_{o}$.

      So the model is useful (at least 2 of the mean thiamin contents differ).

      NOTE: Give full credit if F-table is used to find $F_{obs} > F_{.05}$.

      Analysis of Variance
       
      Source            DF          SS          MS         F        P
      Regression         3      8.9833      2.9944      3.96    0.023
      Residual Error    20     15.1367      0.7568
      Total             23     24.1200
      




next up previous
Next: About this document ...
Gary Sneddon 2003-11-06