This code fine-tunes LayoutLM on the SROIE scanned receipts data, and uses Weights & Biases to log losses and metrics during training, and annotated images with bounding box predictions. Here is the accompanying Report.
First, make sure to install the pipenv environment, using pipenv install
. This requires pipenv to have access to python 3.9. To install and manage different python versions, try out pyenv. All instructions below assume the pipenv
environment is activated; to activate, run pipenv shell
.
The preprocessing for this slightly nonstandard, since the OCR and labels are given in a format that is not consistent with the per-token level classification setup that LayoutLM requires. More details given in this section of the report.
To run the preprocessing step, from the base directory, run
python -m scripts.preprocess
To train, run the following command from base directory
python -m scripts.train
The different objects used in preprocessing the data and training the model are contained in the objects
directory. Below is a rough listing of the files and objects contained
- objects
- constants.py
- config
- task_1_dir
- dataset.py
- SROIE(Dataset)
- model.py
- tokenizer
- model
- trainer.py
- Trainer
- transforms.py
- GetTokenBoxesLabels
- constants.py
Special attention should be brought to the callable class GetTokenBoxesLabels
defined in transforms.py
. This does three main things
- Tracks tokenization of words and appropriately duplicates the bounding boxes accommodate the tokenized sequence.
- Pads the input sequence to the max length allowable by the tokenizer (here it is BERTTokenizer, so 256).
- Normalizes coordinates to be between 0 and 1000. This is required by LayoutLM.
An example of why #1 is necessary might be if the sequence of (word, bbox)
pairs corresponding to a segment of text on a document is
[("I", [100, 100, 120, 150]), ("am", [130, 100, 160, 150]), ("sleeping", [140, 100, 280, 150])]
Here the bounding box coordinates are in the format [x1, y1, x2, y2]
, where x1
and x2
are the left- and right- most coordinates of the bounding box; and similarly y1
and y2
are the top- and bottom- most coordinates. The tokenizer itself operates only on the sequence of words
I am sleeping
and returns the sequence of tokens
I am sleep ##ing
But does not operate on the bounding boxes. For the purposes of LayoutLM, we want the (token, bbox)
sequence to be
[("I", [100, 100, 120, 150]), ("am", [130, 100, 160, 150]), ("sleep", [140, 100, 280, 150]), ("##ing", [140, 100, 280, 150])]
GetTokenBoxesLabels
takes care of this.