next up previous
Next: About this document ...

Statistics 2501 (001)
Assignment #3: Solutions



  1. Regression model to relate university GPA ($y$) to high school GPA, SAT score and hours of extracurricular activities.

    1. Find the least squares regression line.

      From the output below, we see the least squares line is

      \begin{displaymath}
\hat{y} = 0.72 + 0.611 x_{1} + 0.00271 x_{2} + 0.0463 x_{3}
\end{displaymath}

      Regression Analysis: Univ GPA versus HS GPA, SAT, Activiti
      
      The regression equation is
      Univ GPA = 0.72 + 0.611 HS GPA + 0.00271 SAT + 0.0463 Activiti
      

    2. Interpret $\hat{\beta}_{2}$.

      $\hat{\beta}_{2} = 0.00271$. This means that, as SAT increases by 1 point, we predict that university GPA will increase by 0.00271 points.

    3. Test if model is useful.

      $H_{o}: \beta_{1} = \beta_{2} = \beta_{3} = 0
\mbox{ (Model not useful)}$
      $H_{a}: \mbox{at least 1 } \beta_{i} \neq 0 $

      From the Minitab output below

      \begin{displaymath}
F_{obs} = \frac{MS(model)}{MSE} = 12.96, \quad
\mbox{p-value} = P(F > 12.96) \approx 0
\end{displaymath}

      using an F-distribution with $k = 3$ and $(n - k - 1) = 96$ df.

      So we have very strong evidence against $H_{o}$.

      Model appears useful in predicting university GPA.

      Analysis of Variance
      
      Source            DF          SS          MS         F        P
      Regression         3     160.237      53.412     12.96    0.000
      Residual Error    96     395.697       4.122
      Total             99     555.934
      

    4. Test at $\alpha = 0.01$ if extracurricular activities should be dropped.

      $H_{o}: \beta_{3} = 0$
      $H_{a}: \beta_{3} \neq 0$

      From the output below: $t_{obs} = 0.72$, p-value = $2P(t > \vert.72\vert) = 0.472$ using T-distribution with $(n - k - 1) = 96$ df.

      Very little evidence against $H_{o}$.

      So activities can probably be dropped from the model.

      Predictor        Coef     SE Coef          T        P
      Constant        0.721       1.870       0.39    0.701
      HS GPA         0.6109      0.1007       6.06    0.000
      SAT          0.002708    0.002873       0.94    0.348
      Activiti      0.04625     0.06405       0.72    0.472
      

    5. Find 90% PI for $y$ if SAT = 550, high school GPA = 9 and Activities = 8.

      From the Minitab output below, the 90% PI is

      \begin{displaymath}
\hat{y} \pm t_{(.1/2)} \sqrt{\hat{\sigma}^{2} +
s^{2}_{\hat{y}} } =
(4.669, 11.488).
\end{displaymath}

      NOTE: I suspect there may be cases where a very strange PI is found. That is probably because the high school GPA and SAT were put in backwards, i.e. 550 was put in for $x_{1}$, not $x_{2}$. If that has happened, just take off 0.5 points.

      Predicted Values for New Observations
      
      New Obs     Fit     SE Fit         90.0% CI             90.0% PI
      1         8.079      0.303   (   7.575,   8.582)  (   4.669,  11.488)   
      
      Values of Predictors for New Observations
      
      New Obs    HS GPA       SAT  Activiti
      1            9.00       550      8.00
      

    6. Residual plots. The plots are attached.

      NOTE: The plots can use either the standardized residuals or the residuals.

      The plot of the residuals vs. the $\hat{y}$ values does not show any pattern, nor do there appear to be any outliers. Therefore, most of our model assumptions seem satisified.

      The QQ-plot appears linear, suggesting that it is safe to assume that the errors are normally distributed.

    7. See if interaction term can be dropped from new model.

      $H_{o}: \beta_{4} = 0$
      $H_{a}: \beta_{4} \neq 0$

      From the Minitab output below: $t_{obs} = 0.33$, p-value = $2P(t > \vert.33\vert) = 0.741$ using T-distribution with $(n - k - 1) = 95$ df.

      Very little evidence against $H_{o}$.

      So we don't need the interaction term.

      NOTE: You may have noticed a strange thing in this problem. Without the interaction term, it appears that one variable (high school GPA) is useful. However, when a (high school GPA $\times$ SAT) interaction is included, no variables appear useful.

      To understand bizarre results like this, you'll definitely want to attend Stats 3521 in the Winter :-)

      Predictor        Coef     SE Coef          T        P
      Constant        2.738       6.375       0.43    0.669
      HS GPA         0.3519      0.7888       0.45    0.657
      SAT          -0.00104     0.01167      -0.09    0.929
      Activiti      0.04886     0.06483       0.75    0.453
      gpasat       0.000480    0.001449       0.33    0.741
      

  2. Drywall sales data.

    1. See if apartment vacancy rate can be dropped.

      $H_{o}: \beta_{3} = 0$
      $H_{a}: \beta_{3} \neq 0$

      From the output below: $t_{obs} = -1.36$, p-value = $2P(t > \vert-1.36\vert) = 0.194$ using T-distribution with $(n - k - 1) = 14$ df.

      Little evdience against $H_{o}$.

      There is little evidence that apt. vacancy rate is needed.

      The regression equation is
      Drywall = - 138 + 4.97 Permits + 20.7 Mortgage - 10.9 A Vacanc + 0.50 O Vacanc
      
      Predictor        Coef     SE Coef          T        P
      Constant       -137.9       163.0      -0.85    0.412
      Permits        4.9657      0.4869      10.20    0.000
      Mortgage        20.70       19.77       1.05    0.313
      A Vacanc      -10.946       8.019      -1.36    0.194
      O Vacanc        0.504       3.336       0.15    0.882
      
      S = 43.68       R-Sq = 89.6%     R-Sq(adj) = 86.6%
      
      Analysis of Variance
      
      Source            DF          SS          MS         F        P
      Regression         4      229881       57470     30.11    0.000
      Residual Error    14       26717        1908
      Total             18      256599
      

    2. 99% CI for E($y$).

      From the output below, the 99% CI for mean sales if 40 permits were issued, mortgage rates were 8.5%, and vacancy rates were 2% and 10% respectively is

      \begin{displaymath}
\hat{y} \pm t_{(.01/2)} s^{2}_{\hat{y}} = (126.3, 313.4)
\end{displaymath}

      We're 99% confidence that average sales would be between 126.3 (x 100) sheets and 313.4 (x 100) sheets.

      Predicted Values for New Observations
      
      New Obs     Fit     SE Fit         99.0% CI             99.0% PI
      1         219.9       31.4   (   126.3,   313.4)  (    59.7,   380.0)   
      
      Values of Predictors for New Observations
      
      New Obs   Permits  Mortgage  A Vacanc  O Vacanc
      1            40.0      8.50      2.00      10.0
      

    3. Residual plots. These are attached. Again, the plots can use either standardized residuals or regular residuals.

      The plot of the residuals vs. the $\hat{y}$ values does not appear to have a pattern, or any noticeable outliers (even though the Minitab output flags one potential outlier). Therefore most of our model assumptions appear satistfied.

      The QQ-plot appears linear, so the assumption of normally distributed errors is reasonable.

  3. Refer to #10.35.

    1. The plot of the data is attached. There does seem to be one unusual point, in the upper-right hand portion of the plot. It has a much larger $x$ and $y$ value than all the other values.

    2. The plot of the least squares line with the data is attached. The value of $R^{2}$ is 0.793. This is telling us that our model is doing a reasonable job of predicting pay from performance.

      As an added point (doesn't have to be included in discussion), it looks like the least squares line comes reasonably close to the unusual value on our plot.

    3. The residual plot is attached. There appears to be a pattern in this plot; in particular, the residuals seem to decrease in a linear fashion as the fitted values ($\hat{y}$) increase.

      This pattern implies that one (or more) of our model assumptions is not satisfied.

      Our unusual point also stands out for being far away from the other points, in terms of its $\hat{y}$ value, and because it also has a reasonably large residual.

    4. Remove ``unusual'' point, redo parts (b) and (c).

      (b) The plot of the least squares line on the new data is attached. The $R^{2}$ value is 0.129, which is much smaller than observed with the entire dataset. This is telling us that the model we have is not doing a very good job.

      NOTE: The explanation below is not required.

      What happened? It's difficult to say, to be honest. If we look at this new plot, it seems like the data exists in two different groupings on our plot. However, the least squares line goes right between the groupings, so is a poor description of the relationship. So the low $R^2$ is probably reflecting this.

      However, that doesn't explain why we go a more reasonable value of $R^2$ with the unusual value included.

      I guess this problem is a good illustration of what an influential observation can do, and why we don't always have simple answers to things in statistics.

      (c) Residual plot. Although there is no dramatic pattern like before, there appears to be a pattern. In particular, it looks like the residuals are getting further from 0 as $\hat{y}$ increases.

      This suggests that the assumption that the errors have constant variance is violated.

  4. Problem 11.52, p. 590.

    1. Plot of data, by hand. The plot is attached.

      The relationship is definitely not linear. The pattern appears to be curved, with $y$ dropped quickly as $x$ increases to start, then it starts to level off, and perhaps increase slightly.

    2. Use output in book:

      $H_{o}: \beta_{2} = 0$
      $H_{a}: \beta_{2} \neq 0$

      From output: $t_{obs} = 2.69$, p-value = $2P(t > \vert 2.69\vert) = 0.031$.

      Test at $\alpha = 0.1$: Since $0.031 < 0.1$, reject $H_{o}$.

      Equivalently, look up $t_{\alpha/2} = t_{0.05}$ in T-table, and reject since $2.69 > t_{0.05}$.

      In the words of the problem: we need the weight-squared term, or we conclude that there is a quadratic relationship between weight and ENE.




next up previous
Next: About this document ...
Gary Sneddon 2003-10-25