Automatic customization of offline handwritten text classifier to individual users

Code for master thesis and paper, focus on offline writer identification on character level.
The proposed method boost neural network classifier up to +2.7%.
The method achieves state of the art.

papers

Contains source pdfs for several dataset etc.

src

Nist

Source code consist of a couple of separate files:

NIST_Data_Util.ipynb - prepare dataset
NIST_Data_Util_Advanced.ipynb - create ImageDisk and InfoDisk files
NIST_Baseline_Training.ipynb - train baseline network and save weights and models (baseline+finder)
NIST_Clustering_Knn_Evaluation.ipynb - cluster characters on train+val, create knn method and evaluate on final 200 writers (+2.36%)

ETH_Zurich

Link_For_Prepare_Dataset_Tools
ETH_Zurich_Baseline_Evaluation.ipynb - train baseline network and save weights and models (baseline+finder)
ETH_Zurich_Clustering_Knn_Evaluation.ipynb - cluster characters on train+val, create knn method and evaluate on final 25 writers

CVL_Single_Digit_Dataset

CVL_Single_Digit_Dataset.ipynb - Full process for one simple dataset. We are better on low train baseline; If baseline is too good, we are worst than it.

Weights

Nist weights

Weights are stored in separate folder NIST_weights (baseline+finder).

ETH_Zurich weights

Weights are stored in separate folder ETH_Zurich_weights (baseline+finder).

DATASETS

Parse UBuffalo inkmls (todo)

UBuffalo inkmls with writer identification parse it with CROHME tool HM: "Yes, the segGenerator.py script from crohmelib can do that. This is what we use to generate the isolated symbols from the full expressions (with ground-truth)."

can't do that cause od mail conversation with Harlod Muchere - they have semi-supervised methond for character segmentation in Crohme

Nist Offline Handwritten Database (Manually pre-processed)

Nist by_class
Nist by_write
Processed ImageDisk (MNIST/EMNIST style preprocessed images, 28x28 pixels, grayscale)
Processed InfoDisk (Ground truth information about images and labels and writer ids)

ETH_Zurich dataset:

deepwriting_training.npz
deepwriting_validation.npz
are files in specific dictionary numpy format.

First convert them to:

eth_dataset0.zip
eth_dataset1.zip
eth_dataset2.zip
which are files in .svg format

Then onvert them to 28x28 centered, scale preserved .png images:

ETH_png_archive.zip.

Archive contains 294 writers, 369455 images, average: 1256 images/writer No of labels: 70 Average character / writer x label = 17.95 (Strongly unbalanced). - Link to dataset on GoogleDrive: [Link](https://drive.google.com/open?id=1AdX4sndfOcl32B9CSWPn5VNDl4kW1hAP)

cugurm/CNN-handwritten-classifier-improvement