Repo for paper "Retrieval Enhanced Model for Commonsense Generation"(ACL-IJCNLP 2021).
You can get CommonGen dataset from their official website or Huggingface Datasets Hub.
We provide our external corpora
consisting of VATEX, Activity, SNLI, and MNLI in data
folder.
In addition, you can build your own external corpora.
- Download Entity Vocab of ConceptNet here.
- Use the external corpora to construct a pre-training dataset similar to the CommonGen format following the example:
DATA_PATH=/your/path/to/data
OUTPUT_PATH=/your/path/to/output
CONCEPT_PATH=/your/path/to/concept
TEST_SRC_PATH=/your/path/to/commongen_test_src_path
TEST_TGT_PATH=/your/path/to/train/commongen_test_tgt_path
python data_augmentation.py \
--data_path $DATA_PATH\
--output_path $OUTPUT_PATH\
--concept_path $CONCEPT_PATH\
--commongen_test_src_path $TEST_SRC_PATH \
--commongen_test_tgt_path $TEST_TGT_PATH
- Use
Matching Retriever
to retrieve prototype sentences for pre-training data following the example:
DATA_PATH=/your/path/to/data
CORPUS_PATH=/your/path/to/corpus
REF_PATH=/your/path/to/reference
OUTPUT_PATH=/your/path/to/output
python matching_retriever.py \
--data_path $DATA_PATH \
--corpus_path $CORPUS_PATH \
--reference_path $REF_PATH \
--output_path $OUTPUT_PATH
You can pretrain RE-T5 following the example:
export PYTHONPATH="../":"${PYTHONPATH}"
DATA_DIR=/your/path/to/data
OUTPUT_DIR=/your/path/to/output
MODEL_NAME=t5-base
DATASET_MODE=shuffle # shuffle or original
python finetune.py \
--gpus=4 \
--gradient_accumulation_steps=4 \
--warmup_steps=5000 \
--data_dir=$DATA_DIR \
--model_name_or_path=$MODEL_NAME \
--num_train_epochs=5 \
--learning_rate=2e-6 \
--weight_decay=0.01 \
--adam_epsilon=1e-6 \
--train_batch_size=16 \
--eval_batch_size=16 \
--max_source_length=256 \
--max_target_length=256 \
--val_max_target_length=256 \
--test_max_target_length=256 \
--val_check_interval=1.0 --n_val=-1 \
--output_dir=$OUTPUT_DIR \
--do_train \
--eval_beam=5 \
--overwrite_output_dir \
--dataset_mode=$DATASET_MODE \
"$@"
Note: DATASET_MODE
is used to determine whether to shuffle the input concepts.
- Prepare the training and validation set. For each concept set in CommonGen training set, we use its paired sentence as a positive example and we randomly sample another sentence, also from the training set, as a negative example. The format of data json file should follow the following format:
{"sentence1": "chicken cheese pizza broccoli", "sentence2": "A grilled pizza with chicken, broccoli and cheese.", "label": 1}
{"sentence1": "chicken cheese pizza broccoli", "sentence2": "The pan pizza topped with broccoli, chicken, and cheese, is ready for the oven", "label": 1}
{"sentence1": "chicken cheese pizza broccoli", "sentence2": "model released portrait of an old boy opening the film character presents in holidays", "label": 0}
- Train the
Trainable Retriever
following the example:
TRAIN_DATA_PATH=/your/path/to/train/data
VAL_DATA_PATH=/your/path/to/val/data
MODEL_NAME=bert-base-cased
OUTPUT_DIR=/your/path/to/output
python -m torch.distributed.launch \
--nproc_per_node 2 run_glue.py \
--model_name_or_path ${MODEL_NAME} \
--train_file ${TRAIN_DATA_PATH} \
--validation_file ${VAL_DATA_PATH} \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 64 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir ${OUTPUT_DIR} \
--save_total_limit 5
- Obtain sentences that contained the concepts.
DATA_PATH=/your/path/to/data
CORPUS_PATH=/your/path/to/corpus
REF_PATH=/your/path/to/reference
OUTPUT_PATH=/your/path/to/output
python retrieve_all_ids.py \
--data_path $DATA_PATH \
--corpus_path $CORPUS_PATH \
--reference_path $REF_PATH \
--output_path $OUTPUT_PATH
- Use the Trainable Retriever to rank these candidates from previous step to select top-k prototype sentences.
DATA_PATH=/your/path/to/data
CORPUS_PATH=/your/path/to/corpus
RE_IDS_PATH=/your/path/to/retrieval_ids/from/previous/step
OUTPUT_PATH=/your/path/to/output
MODEL_PATH=/your/path/to/trainable_model
python trainable_retriever.py \
--data_path $DATA_PATH \
--corpus_path $CORPUS_PATH \
--retrieval_ids_path $RE_IDS_PATH \
--output_path $OUTPUT_PATH \
--model_path $MODEL_PATH
You can train RE-T5 following the example:
export PYTHONPATH="../":"${PYTHONPATH}"
DATA_DIR=/your/path/to/data
OUTPUT_DIR=/your/path/to/output
MODEL_PATH=/your/path/to/model
DATASET_MODE=shuffle # shuffle or original
python finetune.py \
--gpus=4 \
--gradient_accumulation_steps=3 \
--warmup_steps=400 \
--data_dir=$DATA_DIR \
--model_name_or_path=$MODEL_PATH \
--num_train_epochs=20 \
--learning_rate=5e-5 \
--train_batch_size=64 \
--eval_batch_size=64 \
--max_source_length=128 \
--max_target_length=32 \
--val_check_interval=1.0 --n_val=-1 \
--output_dir=$OUTPUT_DIR \
--do_train \
--eval_beam=5 \
--overwrite_output_dir \
--dataset_mode=$DATASET_MODE \
"$@"
Follow the example:
MODEL_PATH=/your/path/to/model
TEST_SRC_PATH=/your/path/to/test_source
TEST_TGT_PATH=/your/path/to/test_target
OUTPUT_PATH=/your/path/to/output
SCORE_PATH=/your/path/to/rouge.json
python run_eval.py $MODEL_PATH \
$TEST_SRC_PATH \
$OUTPUT_PATH \
--reference_path $TEST_TGT_PATH \
--score_path $SCORE_PATH \
--task summarization \
--max_source_length 128 \
--max_target_length 32 \
--num_beams 5 \
--length_penalty 0.6 \
--early_stopping true \
--bs 32
Note: I have provided our model inference output in results
folder for reference.
Please follow CommonGen GitHub Repo to build the evaluation environment and evaluate the results.
Note: Use clean_out.py
on your results before evaluation.
@inproceedings{wang-etal-2021-retrieval-enhanced,
title = "Retrieval Enhanced Model for Commonsense Generation",
author = "Wang, Han and
Liu, Yang and
Zhu, Chenguang and
Shou, Linjun and
Gong, Ming and
Xu, Yichong and
Zeng, Michael",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.269",
doi = "10.18653/v1/2021.findings-acl.269",
pages = "3056--3062",
}