Different machine learning models were built and compared to recognize the handwritten digits.
The train.csv
file contains training data, which consists of 42000 images of dimensions 28 pixels x 28 pixels each. Each value in the training data represents the brightness of the particular pixel, and this value varies from 0 to 255, 255 being the brightest pixel value. The first column in the training data represents the target variable whereas the other columns represent the feature variables.
The test.csv
file contains testing data, which consists of 28000 images of dimensions 28 pixels x 28 pixels each. Each value in the training data represents the brightness of the particular pixel, and this value varies from 0 to 255, 255 being the brightest pixel value.
Model | Accuracy Score |
---|---|
Logistic Regression | 0.8 (80%) |
Random Forest Clasifier | 0.8 (80%) |
Decision Tree Classifier | 0.8 (80%) |
Naive Bayes classifier | 0.8 (80%) |
From the table above we can see that a Convolutional Neural Networks model has the best performance in recognizing handwritten numbers. Let us now understand the working of a convolutional neural network.
Convolutional Neural networks is a branch in Deep-learning that is found to be very effective in the field of media processing such as image recognition and audio/video recognition. The dimentionality of the image is reduced before it is fed to a full-connected neural networks, in such a way that all the important features in an image is retained. The important processes in a Convolutional Neural Network in image processing are as follows
#
In the process of Convolution, a Kernel (a matrix) which is used to extract the features from the images, moves over the input image and performs dot product with the sub-region of that image. The output of Convolution is a matrix of this dot product. The kernel is moved across the image from left to right, top to bottom according to the number of steps given by the Stride value. The dimensions of the image after undergoing Convolution is given byO=([I-K]/S)+1
O
stands for Output image dimensionI
stands for Input image dimensionK
stands for Kernel sizeS
stands for Stride#
Padding is the approach where an extra layer of pixels are added around the image copying the pixels from the edge of the image in order to make the process of convolution efficient at the edge pixels. It is used to resolve the Border Effect that is caused when the edge pixels are not processed completely during convolution. The dimensions of the image after undergoing Padding is given byO=([I-K+2P]/S)+1
O
stands for Output image dimensionI
stands for Input image dimensionK
stands for Kernel sizeP
stands for Padding sizeS
stands for Stride#
Pooling is used to scale down the dimensions of the image by retaining the important features in the feature map. The features are scaled down by summarizing the presence of features in patches of the feature maps. These patches are generally known asPool Window
.
The common Pooling methods are- Max Pooling
Used to summarize the most activated presence of a feature in the pooling window - Min Pooling
Used to summarize the least activated presence of a feature in the pooling window - Average Pooling
Used to summarize the average presence of a feature in the pooling window
#
Once the pooled feature map is obtained the next step is to flatten this feature map into a single column before feeding it to the neural network. This step makes the computation of the neural network much efficient and less expensive.