train Calamari models for Upper Sorbian (Fraktur and Antiqua) prints on HPC
Scripts for training Calamari OCR models on ZIH's Power9 NVidia V100 HPC cluster for Upper Sorbian prints.
The GT data is here for Fraktur and here for Antiqua. Production and rights: Sorbian Institute.
The approach was to do finetuning on pretrained models:
- for Fraktur prints (16k lines * 5 kinds of preprocessing):
- with Calamari 2: deep3_fraktur19
- with Calamari 1: fraktur_19th_century
- for Antiqua prints (16k lines * 5 kinds of preprocessing):
- with Calamari 2: deep3_lsh4
- with Calamari 1: antiqua_historical
(We don't want to have voting during inference, therefore we run calamari-train
– not calamari-cross-fold-train
– and pick the first model among the pretrained ensembles, respectively. We use Calamari 2.2.2 / Calamari 1.0.5 CLIs – in an attempt to find similar settings for both versions.)
This repo provides the Slurm scripts, which:
- source an environment script
ocrenv.sh
loading the HPC environment's modules (an Lmod system) and a custom venv (powerai-kernel2.txt
) - checks whether any checkpoints exist in the output directory already –
- if yes, then use
calamari-resume-training
- otherwise, start
calamari-train
- if yes, then use
- sets up all parameters
- wraps the call with Nvidia Nsight for profiling
For optimal resource allocation (empirically determined via Nsight and the PIKA system for job monitoring), we use
- a large batch size (64-80)
- a large number (10) of cores and data workers
- a high amount of RAM (32 GB) per core,
withwithout preloading (but data on RAM disk) and data prefetching (32) - multiple GPUs (with the
MirroredStrategy
for distributed training) on Calamari 2
For optimal accuracy, we use
- re-computing the codec (i.e. keeping only shared codepoints, adding new ones)
- implicit augmentation (5-fold)
- explicit augmentation (by passing raw colors plus multiple binarization variants)
- early stopping (at 10 epochs without improvement)
The models are simply named…
- for Fraktur prints:
hsbfraktur.cala1
(for Calamari 1)hsbfraktur.cala
(for Calamari 2)
- for Antiqua prints:
hsblatin.cala1
(for Calamari 1)hsblatin.cala
(for Calamari 2)
See release archives for model files.
Note: the models seem to have a soft dependency on (meaning the inference quality will be better if)
- textline segmentation with dewarping or some vertical padding (>4px)
- binarization with little to no noise (for Antiqua)
raw colors (for Fraktur)
(This needs to be investigated further.)
...on held out validation data (used for checkpoint selection, 3.2k / 3.8k lines):
model | CER |
---|---|
hsbfraktur.cala1 | 1.82% |
hsbfraktur.cala | 0.50% |
hsblatin.cala1 | 0.95% |
hsblatin.cala | 0.25% |
...on truly representative extra data (771 / 1640 lines):
model | CER |
---|---|
hsbfraktur.cala1 | 0.45% |
hsbfraktur.cala | 0.47% |
hsblatin.cala1 | 1.23% |
hsblatin.cala | 0.52% |
The authors are grateful to the Center for Information Services and High Performance Computing at TU Dresden for providing its facilities for high throughput calculations.