CSE 472: Machine Learning Sessional

This is a course of my L4T2(Final Term). As the name suggests its a course on machine learning. We are to implement different machine learning algorithms from scrach.

Assignment 1: Decision Tree and AdaBoost for Classification

  • Decision Tree classifier
  • Ensemble Learning algorithm AdaBoost using Decision Stump

Language Used: Python

Assignment 2: k-Nearest Neighbor and Naive Bayes for Document Classification

  • k-NN algorithm for text classification

    • Hamming distance: each document is represented as a boolean vector, where each bit represents whether the corresponding word appears in the document.
    • Euclidean distance: each document is represented as a numeric vector, where each number represents how many times the corresponding word appears in the document.
    • Cosine similarity with TF-IDF weights: each document is represented by a numeric vector as in the case of euclidean distance. However, now each number is the TF-IDF(Term Frequency–Inverse Document Frequency) weight for the corresponding word.The similarity between two documents is the dot product of their corresponding vectors, divided by the product of their norms.

    Experimented with $k=1,3,5$ and different distance metric.

  • Naive Bayes for text classification

    • Considered all the words of document independently,then calculated the probability of the document of being a topic, and then picked up the topic which provides the highest probability score.
    • Tried $10$ different smoothing factors and calculate the accuracy for each value of smoothing factor to get the best performing smoothing factor.
  • T-test for comparison

    • Ran $50$ iterations with test domcuments.
    • Compared kNN and NB using Paired T-test with Significance level $\alpha = 0.005,0.01,0.05$

Language Used: Python

Assignment 3: Dimensionality Reduction using Principal Component Analysis and Clustering using Expectation-maximization Algorithm

  • Principal Component Analysis(PCA) implementation : X be a NxD data matrix where D is the number of dimensions and N is the number of instances.

    • Standardize the data,

    • Construct the co-variance matrix,

    • Compute the eigen vectors and eigen values of the co-variance matrix.

    • Now project your data along the two eigen vectors corresponding to the two highest eigen values.

  • Expectation-maximization(EM) Algorithm implementation : Now we will cluster the two-dimensional data assuming a Gaussian mixture model using the EM algorithm. Let a vector x with dimension D can be generated from any one of the K Gaussian distribution where the probability of selection of Gaussian distribution k is wk where,

and the probability of generation of x from Gaussian distribution is given as,

To learn a Gaussian mixture model using EM algorithm, we need to maximize the likelihood function with respect to the parameters. The steps are given below,

  1. Initialize the means,covariances and mixing coefficients and evaluate the initial value of the log likelihood.
  2. E step: Evaluate the conditional distribution of latent factors using the current parameter values,

  1. M step: Re-estimate the parameters using the conditional distribution of latent factors,

  1. Evaluate the log likelihood and check for convergence of the log likelihood. If the convergence criterion is not satisfied return to step 2.