/Document-Classification

Deals with document image classification into 16 categories.

Primary LanguageJupyter NotebookMIT LicenseMIT

Scanned Document Classification

We take lots of scanned images of documents of various type some taken on handheld devices, some using scanners, etc. So, it becomes increasingly important to organize these scanned documents, which requires reliable and high quality classification of these scanned document images into several categories like letter, form, etc.

This is a part of IndoML22(Indian Symposium of Machine Learning-2022) Datathon Challenge.

Data

The training and validation data is provided in the Datathon which is a subset of 16000 grayscale images from the RVL-CDIP dataset with 1000 images belonging to each of the 16 categories in which the images are classified. The competition and the data is released in its Kaggle Competition.

Images span across 16 different categories(with their corresponding labels) from the training set as shown below:

Letter(0) Form(1) Email(2) Handwritten(3)
letter form email handwritten
Advertisement(4) Scientific Report(5) Scientific Publication(6) Specification(7)
advertisement report publication specification
File Folder(8) News Article(9) Budget(10) Invoice(11)
filefolder newsarticle budget invoice
Presentation(12) Questionnaire(13) Resume(14) Memo(15)
presentation questionnaire resume memo

A discussion about the data with few more images from both training and validation set displayed can be seen in the data overview notebook

Task

The task is to build a model to classify the images correctly into it's respective category and the performance will be evaluated using the Mean F1-Score. The F1 score, commonly used in information retrieval, measures accuracy using the statistics precision $(\text{p})$ and recall $(\text{r})$.

Precision is the ratio of true positives $(\text{tp})$ to all predicted positives $(\text{tp} + \text{fp})$. Recall is the ratio of true positives $(\text{tp})$ to all actual positives $(\text{tp} + \text{fn})$. The F1 score is given by:

$$ \text{F1} = 2\frac{\text{p} \cdot \text{r}}{\text{p}+\text{r}}\ \ \mathrm{where}\ \ \text{p} = \frac{\text{tp}}{\text{tp}+\text{fp}},\ \ \text{r} = \frac{\text{tp}}{\text{tp}+\text{fn}} $$

The F1 metric weights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously. Thus, moderately good performance on both will be favored over extremely good performance on one and poor performance on the other.

Method

Various visual feature extraction based methods were applied using EfficientNetV2L pretrained model(trained on ImageNet). Two of them are:

  • EfficientNet followed by FFN (EffNet)
  • Partioned Image based EfficientNet followed by FFN (EffNet-4Piece)
  • InceptionResNetV2 along with RoI based Vision Transformer Network (IncResNet-RoI-ViT) [Model Report]
  • ResNet-VGG-InceptionResNetV2 along with PCA followed by FFN (ResVGGInc-PCA-4Piece) [Model Report]

The results of clustering of the learnt penultimate layer feature vector for the above two models for the training set is shown below:

EffNet (Mean-F1: 0.6) EffNet-4Piece (Mean-F1: 0.68)
EffNet EffNet-4Piece
IncResNet-RoI-ViT (Mean-F1: 0.755) ResVGGInc-PCA-4Piece (Mean-F1: 0.785)
IncResNet-RoI-ViT ResVGGInc-PCA-4Piece

Usage

  • Refer to the IndoML22 folder it contains README.txt file which contains all the information about how to train the ViT model using train.ipynb and inferencing trained model using test.ipynb.
  • Colab Notebooks: train notebook and test.ipynb. Going through README.txt as mentioned above will help better understand the directory structure.
  • Link to the Pretrained Model to be updated.