/LASER-release

Repo for the paper: Towards Few-shot Entity Recognition in Document Images:A Label-aware Sequence-to-Sequence Framework

Primary LanguageJupyter Notebook

Open In Colab

LASER

This is the pytorch implemention for the ACL '22 Finding paper Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework.

Quick Start

See requirements.txt (Generated by pipreqs. If there are any issues, please contact us :)

Pretrained Weight

Please download the pre-trained weight of LayoutReader from here or here, and copy pytorch_model.bin into ./weights/layoutreader/.

Train & Decode & Evaluate FUNSD

git clone https://github.com/zlwang-cs/LASER-release.git
cd LASER-release
mkdir outputs
cd shell_scripts
sh run_few_shot_FUNSD.sh 0

The digit after the last line is the ID of GPU you want to use.


Your Customed Dataset

Required Format

Each dataset involves three files, and please put them in a folder under /data

  1. meta.json
  2. train/test-text-s2s.jsons
  3. train/test-layout-s2s.jsons

Dataset Meta (A json file describing the dataset information)

Contains the following attributes:

  1. labels: The entity types
  2. words: The words used in the labels
  3. tokens: The tokens used in the labels (To achieve better performance, please use simple label words so that each label only involves single token)
  4. token_dict: A dictionary mapping the token to the token index
  5. next_token_dict: A dictionary mapping the token to the possible following token (Refer to the provided dataset, FUNSD, to see the example)

File for text data (train-text-s2s.jsons and test-text-s2s.jsons for train/test respectively).

Each line is a json object which has 4 attributes:

  1. src: The input text
  2. tgt: The input text embeddded with tags: <BEGIN> Sender <END> question <TAG_END>
  3. filename: Name of the file
  4. part_idx: Given that a file is too long or augmented into multiple samples, there will be several pieces of inputs from the same file. We name them sequentially.

File for layout data (train-layout-s2s.jsons and test-layout-s2s.jsons for train/test respectively).

Each line is a json object which has 4 attributes:

  1. src: A list of normalized bounding boxes. Each box corresponds to a word in the src of the text part. [[335, 154, 389, 169],...]
  2. tgt: A list of normalized bounding boxes. Each box corresponds to a word in the tgt of the text part. The special tags are as follows:
  • <BEGIN>: [1001, 1001, 1001, 1001]
  • <END>: [1002, 1002, 1002, 1002]
  • <TAG_END>: [1003, 1003, 1003, 1003]
  • the entity type labels: [1004, 1004, 1004, 1004], ...
  1. w: the original width of the page
  2. h: the original height of the page

Few-shot Info

A json file under data_utils

{
  "1": {                // the number of shots
    "1":                // the random seed used to generate this few-shot list
      [ "A" ],'         // the exact filenames in this list
    "2": 
      [ "B" ],
    ...
  }
}

Shell Script to Run the Experiments

See shell_scripts/run_few_shot_CUSTOM.sh


Collect Results

See collect_results.ipynb


Citation

If you find the project useful, please cite our paper:

@inproceedings{wang2022towards,
  title={Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework},
  author={Wang, Zilong and Shang, Jingbo},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
  pages={4174--4186},
  year={2022}
}