Code for master thesis and paper, focus on offline writer identification on character level.
The proposed method boost neural network classifier up to +2.7%.
The method achieves state of the art.
Contains source pdfs for several dataset etc.
Source code consist of a couple of separate files:
- NIST_Data_Util.ipynb - prepare dataset
- NIST_Data_Util_Advanced.ipynb - create ImageDisk and InfoDisk files
- NIST_Baseline_Training.ipynb - train baseline network and save weights and models (baseline+finder)
- NIST_Clustering_Knn_Evaluation.ipynb - cluster characters on train+val, create knn method and evaluate on final 200 writers (+2.36%)
- Link_For_Prepare_Dataset_Tools
- ETH_Zurich_Baseline_Evaluation.ipynb - train baseline network and save weights and models (baseline+finder)
- ETH_Zurich_Clustering_Knn_Evaluation.ipynb - cluster characters on train+val, create knn method and evaluate on final 25 writers
- CVL_Single_Digit_Dataset.ipynb - Full process for one simple dataset. We are better on low train baseline; If baseline is too good, we are worst than it.
Weights are stored in separate folder NIST_weights (baseline+finder).
Weights are stored in separate folder ETH_Zurich_weights (baseline+finder).
UBuffalo inkmls with writer identification parse it with CROHME tool HM: "Yes, the segGenerator.py script from crohmelib can do that. This is what we use to generate the isolated symbols from the full expressions (with ground-truth)."
- can't do that cause od mail conversation with Harlod Muchere - they have semi-supervised methond for character segmentation in Crohme
- Nist by_class
- Nist by_write
- Processed ImageDisk (MNIST/EMNIST style preprocessed images, 28x28 pixels, grayscale)
- Processed InfoDisk (Ground truth information about images and labels and writer ids)
- deepwriting_training.npz
- deepwriting_validation.npz
are files in specific dictionary numpy format.
First convert them to:
- eth_dataset0.zip
- eth_dataset1.zip
- eth_dataset2.zip
which are files in .svg format
Then onvert them to 28x28 centered, scale preserved .png images:
Archive contains 294 writers, 369455 images, average: 1256 images/writer No of labels: 70 Average character / writer x label = 17.95 (Strongly unbalanced). - Link to dataset on GoogleDrive: [Link](https://drive.google.com/open?id=1AdX4sndfOcl32B9CSWPn5VNDl4kW1hAP)