/CNN-handwritten-classifier-improvement

Code for paper, offline writer identification on character level

Primary LanguageJupyter Notebook

Automatic customization of offline handwritten text classifier to individual users

Code for master thesis and paper, focus on offline writer identification on character level.
The proposed method boost neural network classifier up to +2.7%.
The method achieves state of the art.

papers

Contains source pdfs for several dataset etc.

src

Nist

Source code consist of a couple of separate files:

  • NIST_Data_Util.ipynb - prepare dataset
  • NIST_Data_Util_Advanced.ipynb - create ImageDisk and InfoDisk files
  • NIST_Baseline_Training.ipynb - train baseline network and save weights and models (baseline+finder)
  • NIST_Clustering_Knn_Evaluation.ipynb - cluster characters on train+val, create knn method and evaluate on final 200 writers (+2.36%)

ETH_Zurich

  • Link_For_Prepare_Dataset_Tools
  • ETH_Zurich_Baseline_Evaluation.ipynb - train baseline network and save weights and models (baseline+finder)
  • ETH_Zurich_Clustering_Knn_Evaluation.ipynb - cluster characters on train+val, create knn method and evaluate on final 25 writers

CVL_Single_Digit_Dataset

  • CVL_Single_Digit_Dataset.ipynb - Full process for one simple dataset. We are better on low train baseline; If baseline is too good, we are worst than it.

Weights

Nist weights

Weights are stored in separate folder NIST_weights (baseline+finder).

ETH_Zurich weights

Weights are stored in separate folder ETH_Zurich_weights (baseline+finder).

DATASETS

Parse UBuffalo inkmls (todo)

UBuffalo inkmls with writer identification parse it with CROHME tool HM: "Yes, the segGenerator.py script from crohmelib can do that. This is what we use to generate the isolated symbols from the full expressions (with ground-truth)."

  • can't do that cause od mail conversation with Harlod Muchere - they have semi-supervised methond for character segmentation in Crohme

Nist Offline Handwritten Database (Manually pre-processed)

ETH_Zurich dataset:

First convert them to:

Then onvert them to 28x28 centered, scale preserved .png images:


Archive contains 294 writers, 369455 images, average: 1256 images/writer No of labels: 70 Average character / writer x label = 17.95 (Strongly unbalanced). - Link to dataset on GoogleDrive: [Link](https://drive.google.com/open?id=1AdX4sndfOcl32B9CSWPn5VNDl4kW1hAP)