This repository contains code of Field Extraction from Forms with Unlabeled Data.
CUDA="11.0"
CUDNN="8"
UBUNTU="18.04"
bash install.sh
# under our project root folder
python setup.py develop
*We have pre-processed INV-CDIP test set under datasets/.
*Download our model pre-trained using INV-CDIP unlabeled train set.
python main.py \
--model_name_or_path pretrained_model_acl2022 \
--output_dir $OUTPUT_PATH
*Download images of INV-CDIP test set and put under datasets/imgs.
python vis_results.py --pred_path $OUTPUT_PATH/prediction_pairs.pkl
If you find this codebase useful, please cite our paper:
@article{gao2021field,
title={Field Extraction from Forms with Unlabeled Data},
author={Gao, Mingfei and Chen, Zeyuan and Naik, Nikhil and Hashimoto, Kazuma and Xiong, Caiming and Xu, Ran},
journal={ACL Spa-NLP Workshop},
year={2022}
}
Our code is released under BSD 3-Clause.
Our pre-trained model is released under CC BY-NC 4.0.
The INV-CDIP dataset is released under CC BY-NC 4.0. The underlying documents to which the dataset refers are from the Tobacco Collections of Industry Documents Library. Please see Copyright and Fair Use for more information.
Please send an email to mingfei.gao@salesforce.com if you have questions.