/handwritten-digits-recognition

Image recognition of handwritten digits [MNIST]

Primary LanguagePython

Machine learning semester project for the Statistical Learning course at Aristotle University of Thessaloniki. Task of the project was to perform machine learning algorithms on the benchmark dataset of MNIST, in order to recognize handwritten digit images. MNIST was introduced by Yann LeCunn, and contains 70.000 images of 28x28 pixels each, extending our feature vector to 784 dimensions. The training set comprises of the first 60.000 images and the testing set of the last 10.000 images.

I performed classification, clustering, dimensionality reduction and embedding. At best, SVM achieved an 1.8% error rate.

####Dependencies

  • Python 2.7+
  • Scikit-learn
  • Matplotlib
  • Numpy

####Classification By running svm_mnist.py we run the SVM classification code. The code first loads the dataset via its helper function provided by sklearn. Then it normalizes each pixel at [0,1].

X_train, y_train = np.float32(mnist.data[:60000])/ 255., np.float32(mnist.target[:60000])

In order to be able to run this task in a regular machine, we reduce the dimensions from 784 to 90 with PCA. That way, we keep around 91% of the initial information. PICTURE

After dimensionality reduction, we perform SVM with various kernels and hyperparameters. The following accuracy results are obtained after 5-fold cross validation. PICTURE

Some correct and false classification examples are shown below. At MNIST the "9" digit is confused with "4" sometimes.

Correct prediction False prediction
PICTURE PICTURE

####Dimensionality Reduction By running kpca_mnist.py we run the lda + kernelPCA code. With the new reduced dimensions, we perform kNN and NearestCentroid. Please note that kPCA is a memory intensive process, so we limit our training set to 15.000 samples. The following table presents the classification accuracy with our eventually reduced dimensions down to 9. PICTURE

####Embedding Projections & Clustering Finally, we run cluster_mnist.py in order to project our dataset in the two-dimensional space, leveraging Spectral and Isomap embeddings. By keeping 5000 samples for visualization, we perform spectral clustering. To evaluate the clustering effectiveness, we compute the cluster completeness score which is under 0.5 for both cases. The following scatterplots display the embeddings.

Isomap Spectral
PICTURE PICTURE