Homework 2.
Question 1. Decision Tree Classifier [10 Points]
Data: The zip file “
hw2.q1.data.zip” contains 3 CSV files:
· “
hw2.q1.train.csv” contains 10,000 rows and 26 columns. The first column ‘
y’ is the output variable with 2 classes: 0, 1. The remaining 25 columns contain input features:
x_1, …, x
_25.
· “
hw2.q1.test.csv” contains 2,000 rows and 26 columns. The first column ‘
y’ is the output variable with 2 classes: 0, 1. The remaining 25 columns contain input features:
x_1, …, x
_25.
· “
hw2.q1.new.csv” contains 30 rows and 26 columns. The first column ‘
ID’ is an identifier for 30 unlabeled samples. The remaining 25 columns contain input features:
x_1, …, x
_25.
Task 1. [4 points]
Use 5-fold cross-validation with the 10,000 labeled exampled from “
hw2.q1.train.csv” to determine the fewest number of rules using which a decision tree classifier can achieve mean cross-validation accuracy of at least 0.96. Report the number of rules needed, the cross-validation accuracy obtained, and all the hyper-parameter values for the
DecisionTreeClassifier.
Fewest number of rules needed: ………………. (to achieve mean cross-validation accuracy of at least 0.96)
Mean cross-validation accuracy: ………………………. (
rounded to 4 decimal places)
Non-default hHyper-parameter values for selected DecisionTreeClassifier model:
Task 2. [2 Points]
Train a
DecisionTreeClassifier with the hyper-parameter values determined in Task 1 on all 10,000 training samples and use it to predict the output class ‘
y’ for the 2,000 examples in “
hw2.q1.test.csv”
. Report the following:
·
Accuracy on 2,000 test examples: …………………… (rounded to 4 decimal places)
·
Classification report for the 2,000 test examples:
·
Confusion matrix for the 2,000 test examples:
Task 3. [2 Points]
Use the model trained in Task 2 to predict the output class ‘
y’ for the 30 examples in “
hw2.q1.new.csv”. Specify the predicted classes in the table below:
ID |
predicted y |
1 |
|
2 |
|
3 |
|
4 |
|
5 |
|
6 |
|
7 |
|
8 |
|
9 |
|
10 |
|
11 |
|
12 |
|
13 |
|
14 |
|
15 |
|
16 |
|
17 |
|
18 |
|
19 |
|
20 |
|
21 |
|
22 |
|
23 |
|
24 |
|
25 |
|
26 |
|
27 |
|
28 |
|
29 |
|
30 |
|
Task 4. [2 Points]
Of the 25 input variables which ones are relevant for this classification task?
The following … input variables are relevant for this classification task: …………………
Display your trained decision tree:
Question 2. Supervised machine learning classifiers [10 Points]
Data: The zip file “
hw2.q2.data.zip” contains 3 CSV files:
· “
hw2.q2.train.csv” contains 8,000 rows and 11 columns. The first column ‘
y’ is the output variable with 4 classes: 0, 1, 2, 3. The remaining 10 columns contain input features:
x1, …, x
10.
· “
hw2.q2.test.csv” contains 2,000 rows and 11 columns. The first column ‘
y’ is the output variable with 4 classes: 0, 1, 2, 3. The remaining 10 columns contain input features:
x1, …, x
10.
· “
hw2.q1.new.csv” contains 30 rows and 10 columns. The first column ‘
ID’ is an identifier for 30 unlabeled samples. The remaining 10 columns contain input features:
x1, …, x
10.
Task 1. [6 points]
Use 4-fold cross-validation with the 8,000 labeled exampled from “
hw2.q2.train.csv” to identify a classifier that achieves mean cross-validation accuracy of at least 0.96. You should try several
Scikit-Learn classifiers, including:
GaussianNB, DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier, KNeighborsClassifier, LogisticRegression, SVC, and MLPClassifier. Try different hyper-parameter values for the better performing classifiers to obtain a good set of hyper-parameter values. Then select the best performing model. Report the following:
Selected model with hyper-parameter values
:
Mean cross-validation accuracy: ………………………. (
rounded to 4 decimal places)
Task 2. [2 Points]
Train the classifier with the hyper-parameter values determined in Task 1 on all 8,000 training samples and use it to predict the output class ‘
y’ for the 2,000 examples in “
hw2.q2.test.csv”
. Report the following:
·
Accuracy on 2,000 test examples: …………………… (rounded to 4 decimal places)
·
Classification report for the 2,000 test examples:
·
Confusion matrix for the 2,000 test examples:
Task 3. [2 Points]
Use the model trained in Task 2 to predict the output class ‘
y’ for the 30 examples in “
hw2.q2.new.csv”. Specify the predicted classes in the table below:
ID |
predicted y |
ID_001 |
|
ID_002 |
|
ID_003 |
|
ID_004 |
|
ID_005 |
|
ID_006 |
|
ID_007 |
|
ID_008 |
|
ID_009 |
|
ID_010 |
|
ID_011 |
|
ID_012 |
|
ID_013 |
|
ID_014 |
|
ID_015 |
|
ID_016 |
|
ID_017 |
|
ID_018 |
|
ID_019 |
|
ID_020 |
|
ID_021 |
|
ID_022 |
|
ID_023 |
|
ID_024 |
|
ID_025 |
|
ID_026 |
|
ID_027 |
|
ID_028 |
|
ID_029 |
|
ID_030 |
|