We try to reproduce the experiments for fine-tuning LayoutLMv3 on DocVQA using both extractive and abstractive approach.
I try to present every single detail within this repository. Note, this is not official codebase from LayoutLMv3.
Work In Progress
pip3 install -r requirements.txt
Some of the code in this repository is adapted from this docvqa repo which works on "LayoutLMv1 for DocVQA".
Note that the test set from the docvqa repo does not come with the ground-truth answers.
-
Download the dataset from the DocVQA Website and put
docvqa
folder underdata
folder. -
Run the following command to create the huggingface dataset:
python3 -m preprocess.extract_spans
Then you will get a processed called
docvqa_cached_extractive_all_lowercase_True_msr_True
More details about the statistics after preprocessing, Check out here. The final statistics about the number of spans founded is as follows:Train / #found spans / #not found Validation / #found spans / #not found Test 39,643 / 36,759 / 2,704 5,349 / 4,950 / 399 5,188 NOTE: The microsft READ API for OCR is not available. Please contact me if you want to use this dataset. (Thanks @redthing1 giving me the access.)
-
Run
accelerate config
to configrate your distributed training environment and run the experiments byaccelerate launch docvqa_main.py --use_generation=0
Set
use_generation
to 1 if you want to use the generation model.My distributed training environment: 6 GPUs
Model | Preprocessing | OCR Engine | Validation ANLS | Test ANLS |
---|---|---|---|---|
LayoutLMv3-base | lowercase inputs | built-in | 68.5% | - |
LayoutLMv3-base | lowercase inputs | Microsoft READ API | 73.3% | 74.24% |
LayoutLMv3-base | original cased | Microsoft READ API | 72.7% | - |
LayoutLMv3-base + Bart Decoder | lowercase | Microsoft READ API | 72.5% | - |
LayoutLMv3-base + Roberta-base | lowercase | Microsoft READ API | 73.0% | - |
The performance is still far behind what is reported in the paper.
Note: Adding sliding window gives me the performance around 64% at the moment. It seems harmful to do so.
- Code for tokenization and Collating. (:white_check_mark:)
- Code for Training (:white_check_mark:)
- Further tune the performance by hyperparameters/casing issue (:white_check_mark:)
- Add a decoder for generation (:white_check_mark:)
- Sliding window to handle the issue that the matched answers are out of the 512 tokens. (:white_check_mark:)