Digital Peter: recognition of Peter the Great's manuscripts

Русскую версию документа можно найти тут.

Preprinted version of the paper

Is available at http://arxiv.org/abs/2103.09354

INFO ABOUT DATASETS CORRECTION

Fixed train dataset can be downloaded here.

Moreover, one can fix the old version of train dataset (which can still be found here) by yourself using following command:

python checker_train.py 'train/words'

Here checker_train.py is the script that makes corrections of the 'train/words' - the folder with old versions of transcribed strings.

A complete list of the names of the fixed files (as well as info about these corrections) can be found here and inside the checker_train.py

Statistics (train dataset):

Number of corrected files = 91
Total number of files = 6196
Percentage of corrected files = 1.47%

Similar fixes have been made to test_public and test_private.

Statistics (test_public dataset):

Number of corrected files = 24
Total number of files = 1527
Percentage of corrected files = 1.57%

Actually, the public leaderboard will not be recalculated on the corrected test_public in view of the insignificance of the fixes and the proximity of the end of the competition.

Oppositely, we will calculate private leaderbord on the corrected test_private.

MAIN DESCRIPTION

Digital Peter is an educational task with a historical slant created on the basis of several AI technologies (Computer Vision, NLP, and knowledge graphs). The task was prepared jointly with the Saint Petersburg Institute of History (N.P.Lihachov mansion) of Russian Academy of Sciences, Federal Archival Agency of Russia and Russian State Archive of Ancient Acts.

Description of the task and data

Contestants are invited to create an algorithm for line-by-line recognition of manuscripts written by Peter the Great.

A detailed description of the problem (with an immersion in the problem) can be found in desc/detailed_description_of_the_task_en.pdf

NOT FIXED train dataset can be downloaded here. This dataset was prepared jointly with a working group consisting of researchers from the Saint Petersburg Institute of History (N.P.Lihachov mansion) of Russian Academy of Sciences - specialists in the history of the Petrine era, as well as paleography and archeography. Federal Archival Agency of Russia and Russian State Archive of Ancient Acts were of great help by providing digital copies of autographs.

There are 2 folders inside: images and words. The images folder contains jpg files with cut lines from Peter the Great's documents, and the words folder contains txt files (transcribed versions of jpg files). Mapping is performed by name.

For example,

the original text (1_1_10.jpg):

the translation (1_1_10.txt):

                                  зело многа в гафѣ i непърестано выхо

File names have the following format x_y_z, where x is the series number (a series is a set of pages with text), y is the page number, and z is the line number on this page. Absolute values x, y, z do not make any sense (these are internal numbers). Only the sequence z is important for fixed x_y. For example, in files

  987_65_10.jpg
  987_65_11.jpg
  987_65_12.jpg
  987_65_13.jpg
  987_65_14.jpg

exactly 5 consecutive lines are found.

Thus, by choosing certain values of x and y, it is possible to restore the sequence of lines in a particular document - these will be the numbers z in ascending order for fixed x, y. This fact can be used additionally to improve the quality of recognition.

The file names in the test dataset have the same structure.

The overwhelming majority of the lines were written by the hand of Peter the Great in the period from 1709 to 1713 (there are lines written in 1704, 1707 and 1708, but there are no more than 150 of them; these lines were included in both train dataset and test dataset).

Baseline

Notebook with a baseline task: baseline.ipynb

For text recognition (in baseline), the following architecture is used:

One possible way to improve the performance is to apply a sequence-to-sequence model for post-processing, e.g.:

Encoder-Decoder with Bahdanau Attention;
Transformer-based sequence-to-sequence model.

A jupyter-notebook with the baseline that incorporates the sequence-to-sequence models: baseline_seq2seq.ipynb

The data and models are available here.

Description of the metrics

The leaderboard will take into account the following recognition quality metrics (in the test dataset)

CER - Character Error Rate

$\text{dist}_c$ is the Levenshtein distance calculated for character tokens (including spaces), $\text{len}_c$ is the length of the string in characters.

WER - Word Error Rate

$\text{dist}_w$ is the Levenshtein distance calculated for word tokens, $\text{len}_w$ - is the length of the string in words.

String Accuracy - number of fully matching test strings divided by total number of test strings.

Here we use Iverson bracket:

In the formulas above, $n$ is the size of the test sample, $\text{pred}_i$ is the string of characters that the model recognized in the $i$ -th image, $\text{true}_i$ is the true translation of the $i$ -th image made by the expert.

Follow this link to learn more about the metrics.

You can learn more about the method of calculating metrics in the script eval/evaluate.py. It accepts two parameters as input - eval/pred_dir and eval/true_dir. The eval/true_dir folder should contain txt-files with true strings translations (the structure is the same as in the words folder), while the eval/pred_dir folder should contain txt-files with recognized strings (using the model). Mapping is again done by name. So the lists of files’ names in the folders eval/true_dir and eval/pred_dir should be the same!

The quality can be calculated using the following command (called from the eval folder):

python evaluate.py pred_dir true_dir

The result is displayed as follows:

Ground truth -> Recognized
[ERR:3] "Это соревнование посвящено" -> "Эт срвнование посвящено"
[ERR:3] "распознаванию строк из рукописей" -> "распознаваниюстр ок из рукписей"
[ERR:2] "Петра I" -> "Птра 1"
[OK] "Удачи!" -> "Удачи!"
Character error rate: 11.267606%
Word error rate: 70.000000%
String accuracy: 25.000000%

CER, %, is the key metric used to sort the leaderboard (the less the better). If two or more contestants earn the same CER, they will be sorted using WER, %, (the less the better). If both CER and WER match, String Accuracy, %, will be used (the more the better). Next metric is the Time, sec., - execution time for your model to process the test dataset on NVidia Tesla V100 (the less the better). If all the metrics match, then the first will be the solution loaded earlier in time (if everything is the same here, then we will sort alphabetically by command names).

The latest version of the model (see baseline.ipynb) has the following values for quality metrics calculated on the public part of the test sample:

CER = 10.526%
WER = 44.432%
String Accuracy = 21.662%
Time = 60 sec

The latest version of the baseline (see baseline_seq2seq.ipynb) achieved the following metrics on the public part of the test sample:

Encoder-Decoder with Bahdanau Attention
CER = 14.957%
WER = 49.716%
String Accuracy = 13.547%
Time = 359 sec

Transformer-based sequence-to-sequence model
CER = 14.489%
WER = 54.974%
String Accuracy = 9.228%
Time = 76 sec

Solution format

The accepted solution is ZIP archive, which contains the algorithm (your code) and the entrypoint to run it. The entrypoint should be set in metadata.json file in the root of your solution archive:

{
   "image": "<docker image>",
   "entry_point": "<entry point or sh script>"
}

For example:

{
   "image": "odsai/python-gpu",
   "entry_point": "python predict.py"
}

The data is supposed to be read from /data directory. Your predictions should go to /output. For each picture file from /data <image_name>.jpg you have to get the corresponding recognized text file <image_name>.txt in /output.

The solution is run in Docker container. You can start with the ready-to-go image we prepared https://hub.docker.com/r/odsai/python-gpu. It contains CUDA 10.1, CUDNN 7.6 and the latest Python libraries. Also you can use your own image for the competition, which must be uploaded to https://hub.docker.com. The image name is changed in hereabove mentioned metadata.json.

Provided resources:

8 CPU cores
94 GB RAM
NVidia Tesla V100 GPU

Restrictions:

Up to 5 GB size of the working dir
Up to 5 GB size of an archive with the solution
10 minutes calculation time limit

You can download the example solution: submit_example

Here is the .zip file to build a Docker container from the baseline solution that incorporates the transformer-based sequence-to-sequence model.

Leaderboard

The competition is over. Here is the final leaderboard for this competition. Scores are presented for the private set. Baseline solution presented in this github has the following metrics - 9.786, 44.222, 21.532 (CER,WER,ACC).

ai-forever/digital_peter_aij2020