In this assignment you will compare two classification methods on a dataset that is studied extensively in multivariate analysis. The data are 4 measurements on 50 flowers from each of 3 species (Setosa, Versicolor and Virginica) of iris. The measurements (sepal length and width, and petal length and width) are measured in centimeters.
By typing
attach("/users/math/faculty/sneddon/DATA/.Data")
in Splus, you can access this data in the file iris.dat.
> iris.train <- rbind(iris.dat[1:25,], iris.dat[51:75,],
iris.dat[101:125,])
Now create a validation sample using the rest of the data:
> iris.val <- rbind(iris.dat[26:50,], iris.dat[76:100,],
iris.dat[126:150,])
setosa.mean <- matrix(apply(iris.train[1:25,2:5],2,mean))
setosa.var <- var(iris.train[1:25,2:5])
Also, find the pooled covariance matrix estimate, and save it as pooled.var.
Does it seem reasonable to assume that the population covariance matrices are equal for all 3 species of flowers? Do not perform a formal hypothesis test to answer this question.
Based on
, is this flower classified
correctly? Explain.
iris.discr <- prog.discr(iris.val, setosa.mean, ver.mean, vir.mean, pooled.var)
The output shows what type of flower each as classified as by
, the number classified correctly, and the number of values
in the validation sample.
Report the number of flowers classified correctly, and the percentage misclassified.
iris.tree <- tree(iris.train)
Now, create a plot of the classification tree. You can look at the help file
help(plot.tree)
for information on how to do this.
predict.tree(iris.tree,newdata=iris.val,type="class")
Report the number of flowers classified correctly, and the percentage misclassified. Do the two methods give similar results?