next up previous
Next: About this document ...

Stats 2501: Data Analysis Project

Due: Monday, Dec. 1, 2003



You are required to analyze the data set that is described at the end of this handout. While you may discuss aspects of this project with others in the class, and of course with myself, each of you will submit your own report that contains your own work.

The format of your project should be a report which contains different sections, paragraphs and complete sentences. I will expect to see some computer output included with your report. This will include plots and results given by Minitab. However, I ask that you only submit relevant output. So, if Minitab has done things you haven't wanted at times while working on your dataset, please don't submit that material.

There is no particular length that your report should be. The main thing is you address the issues that are asked. This probably can't be done in 2 pages, but it won't take 50 pages, either. I suspect most reports will be in the 8-13 page range.

Your report may be typed or handwritten. It should contain the following sections. Not all of these sections will be of equal length. In fact, some sections may be very brief.



INTRODUCTION

  1. A brief description of the dataset, and why it may be of interest to study.

  2. A bulleted list of variables and their descriptions.

  3. A small number of questions of interest in bulleted list format. These are issues you will want to address when studying your dataset. You don't have to give many questions; 2 or 3 will suffice.

DATA ANALYSIS

  1. Exploratory Data Analysis. First, create a histogram of the response variable, and comment on its shape. Then, create a collection of plots of your data: each plot should a plot of $y$ vs. each of your potential $x$ variables. However, I only want you to submit 3 or 4 of these plots with your project. The plots you choose to submit are up to you. Describe any features in these plots you submit, and also briefly mention any patterns of interest that were in the plots you did not submit.

  2. Describe what types of terms you are allowing in your models, i.e. quadratic terms, cubic terms, interaction terms, etc.

  3. Develop a model that addresses the question of interest, with the explanatory variables used clearly identified. As an aid in this, you should begin by fitting a very large model, i.e. one with that includes all the potential $x$ variables, including quadratic terms, etc. that you mention in step 2. Then use appropriate methods of statistical inference, as discussed in our course (hypothesis tests, confidence intervals), to work your way toward the final model that you suggest is appropriate.

    You are also encouraged to refer to values such as $R^{2}$ when making general comments about the suitability of a particular model. You should also read section 11.5 of the text, which discusses $R^{2}$, and another measure, called $R_{a}^{2}$, which you may find helpful. The value $R_{a}^{2}$ is labelled R-sq (adj) in Minitab. You should also read about the concept of a parsimonious model in section 11.11.

    For any hypothesis testing that you do, make sure to state your hypotheses and test statistic. Base your conclusions on your p-values whenever possible.

    Make sure to compare the final model you select with your original model, using a partial F-test. However, even if this test suggests you keep the original model, give some argument (which may be non-statistical) to support your choice of model.

  4. Interpret the parameters of the final model in the context of the problem and, if possible, in terms of the questions of interest that you posed.

MODEL FIT

  1. List the assumptions that are made when using your multiple linear regression model.

  2. Use appropriate residual plots to assess these assumptions. Be sure to state if any assumptions appear to be violated.

    In addition, if there appear to be any outliers in the dataset, be sure to mention this. If a small number (say 2 or 3) of the residuals appear to be quite far from 0 (i.e. standardized residuals that are greater than 4), remove these points from your dataset, and re-fit the final model you proposed in the previous section. Have your estimates changed a great deal?

CONCLUSION

  1. Answer the questions from your introduction.

  2. Briefly address the following: Can you think of other variables that could have assisted you, but were not measured or reported?

Dataset Details: Boston Housing Prices



Your chief goal is to develop a regression model that predicts the price of homes in the suburbs of Boston based on size and neighborhood information. However, if you can think of other questions of interest that arise from this dataset, please make sure to mention and address those.

The dataset is available both on our course website (www.math.mun.ca/~sneddon/st2501) and is directly accessible from Minitab. To access it in Minitab, ask Minitab to open a worksheet, and look in

Pub on `CS-thebe', then select the directory stat2501, followed by the file boston.MTW

The dataset contains 13 columns, which contain the variables listed below:

  1. CRIM: per capita crime rate by town.

  2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

  3. INDUS: proportion of non-retail business acres per town.

  4. CHAS: Charles River variable (1 if tract bounds river; 0 otherwise).

  5. NOX: nitric oxides concentration (parts per 10 million).

  6. RM: average number of rooms per dwelling.

  7. AGE: proportion of owner-occupied units built prior to 1940.

  8. DIS: weighted distances to five Boston employment centres.

  9. RAD: index of accessibility to radial highways.

  10. TAX: full-value property-tax rate per $10,000.

  11. PTRATIO: pupil-teacher ratio by town.

  12. LSTAT: Percentage of the population that is lower status.

  13. MEDV: Median value of owner-occupied homes in $1000's.




next up previous
Next: About this document ...
Gary Sneddon 2003-11-06