MthStat 568/768 �Multivariate Statistical Analysis �Spring 2025
Homework 5
Due Wednesday, April 9
1. Consider the spambase data set, where emails are classi�ed as spam or not,
and 57 feature variables are measured on each of them (see full description on
p. 259 of the book).
(a) Split the data set into training and test sets (roughly a 70/30 split). Com-
pute a logistic classi�er using the training data. (There might be perfect
separation between the groups, but that should not matter as long as you
don�t get NA coe¢ cients.)
(b) Find the misclassi�cation table for the test data and compute the mis-
classi�cation rate.
2. Consider the pendigits data set, which are samples of handwritten digits 0; 1; : : : ; 9.
The feature variables in this case are the (x; y) coordinates of the pen tip, dis-
cretized at eight time points (see section 7.2.1 of the book for more details).
(a) Split the data set into training and test sets (roughly a 70/30 split). Com-
pute the multinomial logistic classi�er using the training data.
(b) Construct the misclassi�cation table for the test data and compute the
misclassi�cation rate. Which digit seems to be the hardest to classify
correctly?