LASER

This is the pytorch implemention for the ACL '22 Finding paper Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework.

Quick Start

See requirements.txt (Generated by pipreqs. If there are any issues, please contact us :)

Pretrained Weight

Please download the pre-trained weight of LayoutReader from here or here, and copy pytorch_model.bin into ./weights/layoutreader/.

Train & Decode & Evaluate FUNSD

git clone https://github.com/zlwang-cs/LASER-release.git
cd LASER-release
mkdir outputs
cd shell_scripts
sh run_few_shot_FUNSD.sh 0

The digit after the last line is the ID of GPU you want to use.

Your Customed Dataset

Required Format

Each dataset involves three files, and please put them in a folder under /data

meta.json
train/test-text-s2s.jsons
train/test-layout-s2s.jsons

Dataset Meta (A json file describing the dataset information)

Contains the following attributes:

labels: The entity types
words: The words used in the labels
tokens: The tokens used in the labels (To achieve better performance, please use simple label words so that each label only involves single token)
token_dict: A dictionary mapping the token to the token index
next_token_dict: A dictionary mapping the token to the possible following token (Refer to the provided dataset, FUNSD, to see the example)

File for text data (`train-text-s2s.jsons` and `test-text-s2s.jsons` for train/test respectively).

Each line is a json object which has 4 attributes:

src: The input text
tgt: The input text embeddded with tags: <BEGIN> Sender <END> question <TAG_END>
filename: Name of the file
part_idx: Given that a file is too long or augmented into multiple samples, there will be several pieces of inputs from the same file. We name them sequentially.

File for layout data (`train-layout-s2s.jsons` and `test-layout-s2s.jsons` for train/test respectively).

Each line is a json object which has 4 attributes:

src: A list of normalized bounding boxes. Each box corresponds to a word in the src of the text part. [[335, 154, 389, 169],...]
tgt: A list of normalized bounding boxes. Each box corresponds to a word in the tgt of the text part. The special tags are as follows:

<BEGIN>: [1001, 1001, 1001, 1001]
<END>: [1002, 1002, 1002, 1002]
<TAG_END>: [1003, 1003, 1003, 1003]
the entity type labels: [1004, 1004, 1004, 1004], ...

w: the original width of the page
h: the original height of the page

Few-shot Info

A json file under data_utils

{
  "1": {                // the number of shots
    "1":                // the random seed used to generate this few-shot list
      [ "A" ],'         // the exact filenames in this list
    "2": 
      [ "B" ],
    ...
  }
}

Shell Script to Run the Experiments

See shell_scripts/run_few_shot_CUSTOM.sh

Collect Results

See collect_results.ipynb

Citation

If you find the project useful, please cite our paper:

@inproceedings{wang2022towards,
  title={Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework},
  author={Wang, Zilong and Shang, Jingbo},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
  pages={4174--4186},
  year={2022}
}

zlwang-cs/LASER-release