Digit Recognition

Contributors: Haoyu Li, Tong Xie, Judy Zhu

Table of Contents

Overview
- Requirements
- Dataset
Data Exploration
Model Training
Performance Analysis
- Accuracy Scores
- Confusion Matrices
Efficiency Anlysis
- Execution Times
- Efficiency Plot
Comparison between Handwritten and Package Function
- Accuracy Matrix
Scope and Limitations
License
Reference and Acknowledgments
Demo Video

Overview

The objective of this project is to classify hand-written digits from 0 to 9 using various machine learning models and compare their efficiency. Four models are selected for analysis: Logistic Regression, MLP, Random Forest Classifier, KNN. A manual implementation of Logistic Regression is also included for binary classification to compare with the package logistic regression function.

Requirements

Numpy (1.21.5) - Multidimensional Mathematical Computing
Pandas (1.4.2) - Access Tabular data
Matplotlib (3.5.1) - Used to plot Graph
Sklearn (1.0.2) - Machine learning models

Dataset

OpenML Dataset - Dataset for Digit Recognition

Data Exploration

The dataset is a collection of 70,000 handwritten digits ranging from 0 to 9.

Image of each data point is 28 pixels wide and long, which gives the total of 28 x 28 = 724 pixels. The 724 pixels make up the 724 features of a data point and each feature stores a value ranging from 0 to 1 to indicate the handwritting.

Examples of data point images

Spliting the dataset into features X and label y

Model Training

The 70/30 train_test_split is performed before fitting the models.

X_train

y_train

Shape of X_train, y_train

Four classification models (Logistic Regression, MLP, Random Forest Classifier, KNN) are trained using model.fit(X_train, y_train)
The corresponding scores are then calculated using model.score(X_test, y_test)

Performance Analysis

Accuracy Scores

Logistic Regression: 0.91614
Multilayer Perceptron: 0.97452
Random Forest Classifier: 0.96795
K-nearest Neighbors: 0.97086

Confusion Matrices

Visualize and compare the performance of models used

How to interprete these matrices?

For entry i, j in the matrix, the number in it represents the number of predicts which has predicted label i and actual label j.
As can be seen in the color bar on the right hand side, darker blue means more, so we can see most of samples are predicted correctly for all the models.
One particularly intresting thing is that, it seems that our models are good at predicting different kinds of numbers, for example, logistic regression often confuses between 1 and 8, but random forest and KNN did super good on it.

Overall, the four different classification models used all show higher inaccurary with predicting digits 2, 3, and 8.

Efficiency Analysis

This section analyzes the efficiency of the four models by comparing their accuracy score and execution time.

Execution Times (Seconds)

Logistic Regression: 54.3764
Multilayer Perceptron: 51.386
Random Forest Classifier: 21.981
K-nearest Neighbors: 36.4333

How to interprete these histograms?

It seems that logistic regression is bad in this particular dataset. It's not that accurate, and takes a long time...
Multilayer perceptron is good in performance, but in the same time it's quite time consuming. I guess that's like a typical neural network behavior, which seems to take a lot of resourses.
KNN and Random Forest seems to be the best algorithms to employ here. They both have good performance (KNN is slightly higher but within range of error), and they are both resource friendly.

Efficiency Plot

How to interprete this scatter plot?

This graph plots one time behavior (score vs. time) for all four classification models
Points closer to the bottom right have higher efficiency (high accuracy score + short execution time)

According to the metrics of accuracy score and execution time, K-nearest Neighbors and Random Forest Classifier appear to be the best-performing models out of the four utilized in this project. To further extend the study, other prevailing models for image classification such as Convolutional Neural Network (CNN) may also be included for analysis.

Comparison between Handwritten and Package Function

We implemented a binary logistic regression classifier based on gradient descent. Then we compared this handwritten class performance with the sklearn package logistic regression function with solver liblinear.

Accuracy Matrices

How to interprete these two matrices?

Overall I think the handwritten model did fairly well in predicting data, the accuracy achived by the library function is only slightly better.
It's interesting to see that both the handwritten models and the provided models are weak in classifying between certain numbers like 3 and 5.
However, a major drawback for the handwritten naive version is that the time performace is much weaker than the library function, and I belive that makes sense.

Scope and Limitations

This method of digit recognition has errors and might results in incorrection preditions of the handwrittings compared to classification by humans. This misinformation may cause issues when working with real-life applications that require high accuracy in data recognition.
The current sample database is limited to the handwritting styles collected. Therefore may not be applicable to certain texts or may result in highly inaccurate predictions.
To enhance the performance of this model, more complex and extensive database that contains diverse handwritting styles should be included.
Potential extension of this project may include generalizing the recognition for numbers with more digits, other handwritten characters, such as alphabets, mathematical symbols, calligraphy, and foreigns characters.

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Reference and Acknowledgment

UCLA PIC16A - Source of most theoretical concepts
CS229.stanford.edu - Some theoretical concepts of Logistic Regression
UCLA Math156 Notes - Background knowledge on implementation of Logistic Regression in Python

Demo Video

The following is the link for our demo video of the project, which consists of a short overview of the project and a presentation of our final result: Project Demo Video

(back to top)

haoyuli02/Digit-recognition