next up previous
Next: About this document ...

Stats 6590: Assignment #4

Oct. 10, 2002. Due Date: Oct. 23, 2002




In this assignment you will compare two classification methods on a dataset that is studied extensively in multivariate analysis. The data are 4 measurements on 50 flowers from each of 3 species (Setosa, Versicolor and Virginica) of iris. The measurements (sepal length and width, and petal length and width) are measured in centimeters.

By typing

attach("/users/math/faculty/sneddon/DATA/.Data")
in Splus, you can access this data in the file iris.dat.

  1. Create a matrix or data frame for a training sample of the data. For now, this training sample will contain the first 50% of each type of flower. One way to do this is by typing:

    > iris.train <- rbind(iris.dat[1:25,], iris.dat[51:75,],
                       iris.dat[101:125,])
    

    Now create a validation sample using the rest of the data:

    > iris.val <- rbind(iris.dat[26:50,], iris.dat[76:100,],
                       iris.dat[126:150,])
    

  2. Find the sample mean vectors for each of the 3 flower species in the training sample. As an example, you can do this for the Setosa flowers as follows:

     setosa.mean <-  matrix(apply(iris.train[1:25,2:5],2,mean))
    

  3. Find the sample covariance matrices for each of the 3 flower species in the training sample. As an example, you can do this for the Setosa flowers as follows:

     setosa.var <- var(iris.train[1:25,2:5])
    

    Also, find the pooled covariance matrix estimate, and save it as pooled.var.

    Does it seem reasonable to assume that the population covariance matrices are equal for all 3 species of flowers? Do not perform a formal hypothesis test to answer this question.

  4. Use Splus to evaluate the linear discriminant analysis function

    \begin{displaymath}
D_{i}^{2}({\bf x}_{0}) =
({\bf x}_{0} - \bar{\bf x}_{i})^{...
..._{p}^{-1} ({\bf x}_{0} - \bar{\bf x}_{i}),
\qquad i = 1, 2, 3
\end{displaymath}

    where ${\bf x}_{0}$ is the first observation in your validation sample.

    Based on $D_{i}^{2}({\bf x}_{0})$, is this flower classified correctly? Explain.

  5. Run the program prog.discr to classify all 75 flowers in the validation set, using:

     iris.discr <- prog.discr(iris.val, setosa.mean, ver.mean, vir.mean,
    			  pooled.var)
    

    The output shows what type of flower each as classified as by $D_{i}^{2}$, the number classified correctly, and the number of values in the validation sample.

    Report the number of flowers classified correctly, and the percentage misclassified.

  6. Use a classification tree to classify the observations. First, construct the tree using the training sample:

    iris.tree <- tree(iris.train)

    Now, create a plot of the classification tree. You can look at the help file

    help(plot.tree)

    for information on how to do this.

  7. Use the tree you found above to classify the first flower in your validation sample. Does the classification tree classify it as setosa? Explain.

  8. Use the tree to classify all observations in your validation sample using:

    predict.tree(iris.tree,newdata=iris.val,type="class")

    Report the number of flowers classified correctly, and the percentage misclassified. Do the two methods give similar results?

  9. Repeat your analyses with a training sample of 10 observations of each flower and a validation sample of 40 observations of each flower. How do the results compare?




next up previous
Next: About this document ...
Gary Sneddon 2002-10-10