Q1.K-Nearest Neighbors
1. Implement a KNN based classifier to predict digits from images of handwritten
digits in the dataset.
2. Featurize the images as vectors that can be used for classification.
3. Experiment with different values of K(number of neighbors).
4. Experiment with different distance measures - Euclidean distance, Manhattan dis-
tance,
5. Report accuracy score, F1-score, Confusion matrix and any other metrics you feel
useful.
6. Implement baselines such as random guessing/majority voting and compare perfor-
mance. Also, report the performance of scikit-learn’s kNN classifier. Report your
findings.
Q2. K-Nearest Neighbors
1. Implement a KNN based classifier to classify given set of features in Mushroom
Database. Missing data must be handled appropriately.(Denoted by ”?”).
2. Choose an appropriate distance measure for categorical features.
3. Experiment with different values of K(number of neighbors).
4. Report accuracy score, F1-score, Confusion matrix and any other metrics you feel
useful.
5. Implement baselines such as random guessing/majority voting and compare perfor-
mance. Also, report the performance of scikit-learn’s kNN classifier. Report your
findings.
Q3. Decision Tree
1. Implement a decision tree to predict housing prices for the given dataset using the
available features.
2. The various attributes of the data are explained in the file data description.txt.
Note that some attributes are categorical while others are continuos.
3. Feel Free to use Python libraries such as binarytree or any other library in Python
to implement the binary tree. However, you cannot use libraries like scikit-learn
which automatically create the decision tree for you.
4. Use variance reduction as the criterion for choosing the split in the decision tree.
Experiment with different approaches to decide when to terminate the tree.
5. Report metrics such as Mean Squared Error(MSE) and Mean Absolute Error(MAE)
along with any other metrics that you feel may be useful.
6. For feature engineering, you may consider normalizing/standardizing the data.SMAI (CSE/ECE 478)
7. Implement simple baselines such as always predicting the mean/median of the train-
ing data. Also, compare the performance against scikit-learn’s decision tree. Report
your findings.
Q4. Gussian Mixture Models Clustering
1. You are given 3 data files(dataset1.pkl,dataset2.pkl,dataset3.pkl) and 1 code file
gmm.py. The code consists of -
(a) Function to load dataset.
(b) Function to save dataset.
(c) Class GMM1D which consists multiple functions.
2. Load dataset .
3. Use inbuilt sklearn functions to cluster(GMM clustering) the points and plot them.
Also report no of iterations taken to converge.
4. In GMM1D, fill in the blanks with code and cluster the points. Plot for each
iteration.
Q5. Linear Regression
1. Given a NASA data set, obtained from a series of aerodynamic and acoustic tests
of two and three-dimensional airfoil blade sections. Implement a linear regression
model from scratch using gradient descent to predict scaled sound pressure level.
The various attributes of the data are explained in the file description.txt.
2. Using appropriate plot show how number of iterations is affecting the mean squared
error for above model under below given conditions:
(a) Using 3 different initial regression coefficients (weights) for fixed value of learn-
ing parameter (All 3 in single plot).SMAI (CSE/ECE 471)
Assignment 2 - Page 3 of 4
Posted: 16/02/2020
(b) Using 3 different learning parameters for some fixed initial regression coeffi-
cients. (All 3 in single plot)
Q6. Linear Regression
1. Given a dataset containing historical weather information of certain area, imple-
ment a linear regression model from scratch using gradient descent to predict the
apparent temperature. The various attributes of the data are explained in the file
description.txt. Note that attributes are text, categorical as well as continuous.
Note: Test data will have 10 columns. Apparent temperature column will be
missing from in between.
2. Compare the performance of different error functions ( Mean square error, Mean
Absolute error, Mean absolute percentage error) and explain the reasons for the
observed behaviour.
3. Analyse and report the behaviour of the regression coefficients(for example: sign
of coefficients, value of coefficients etc.) and support it with appropriate plots as
necessary.
Q7. Support Vector Machine
1. Given a dataset which contains a excerpts of text written by some author and the
corresponding author tag, implement an SVM classifier to predict the author tag
of the test text excerpts.
2. For the feature extraction of the text segments, either use Vectorizers provided in
sklearn or use pre-trained word embedding models. ( Code snippet for usage of
word embedding models is given here).
3. Visualize the feature vectors and see if you could find some pattern.
4. Tweak different parameters of the Linear SVM and report the results.
5. Experiment different kernels for classification and report the results.
6. Report accuracy score, F1-score, Confusion matrix and any other metrics you feel
useful.
7. (Bonus-20 points) You may do some pre-processing on textual data to improve
your classifier. Explain why score has improved if it did.
8. Link to the dataset has been provided in the common link.
9. You can use inbuilt functions for SVM.SMAI (CSE/ECE 471)
Q8. Clustering
1. Given a dataset of documents with content from 5 different fields ( namely busi-
ness, entertainment, politics, sport, and tech ), cluster them using any clustering
algorithm of your choice.
2. Do not use any libraries for this part. You are expected to code your clustering
algorithm from scratch.
3. For feature extraction you can use the vectorizers provided by sklearn or by using
the pre trained embeddings. ( Code snippet for the usage of these embeddings has
been provided in the previous question ).
4. You might have to perform some pre-processing on the raw documents before you
apply your algorithm.
5. We have provided ground truth document tags for the documents. Report accuracy
score on these documents.
6. We will test your score on the documents for which the tags have not been provided.
7. In the dataset, the number after the ’ ’ symbol in the file name denotes the cluster
label.
8. The code file must be a python(.py) file. You are expected to define a class for each
question which is compatible with the test.py file provided here. Make sure your
code can be run by ”python test.py”. Double check this.
Q9. Image Classification
1. Given CiFAR-10 dataset, implement a linear SVM classifier to predict the classes
of the test images.
2. Featurize the images as vectors that can be used for classification.
3. Report your observations for different values of C. Explain the significance of C.
4. Compare and contrast the classifier with the KNN classifier built in the previous
assignment.
5. Report accuracy score, F1-score, Confusion matrix and any other metrics you feel
useful.
6. Report the support vector images in each case.
7. (Bonus-20 points) You may do some processing on the train set to improve your
scores on linear SVM. Report your changes clearly.
8. You can use inbuilt functions for SVM.