
:coconut: Code & Data for Comparative Opinion Summarization via Collaborative Decoding (Iso et al; Findings of ACL 2022)

Comparative Opinion Summarization via Collaborative Decoding

This repository contains the dataset, source code and trained model Comparative Opinion Summarization via Collaborative Decoding.



    title = {{C}omparative {O}pinion {S}ummarization via {C}ollaborative {D}ecoding},
    author = {Hayate Iso and
              Xiaolan Wang and
              Stefanos Angelidis and
              Yoshihiko Suhara},
    booktitle = {Findings of the Association for Computational Linguistics (ACL)},
    month = {May},
    year = {2022}


Please use the command below to setup and install requirements.

conda create -n cocosum python=3.8
conda activate cocosum
pip install -r requirements.txt 

CoCoTrip dataset

The script prep.py automatically downloads the public TripAdvisor dataset by Wang+(2010), and builds the CoCoTrip dataset, which includes self-supervised training and evaluation sets.

python prep.py ./data/
head -n 50000 ./data/train_cont_all.jsonl > ./data/train_cont_50k.jsonl 
head -n 1000 ./data/train_comm_pair.jsonl > ./data/train_comm_pair_1k.jsonl

Available models

All models are hosted on huggingface 🤗 model hub (https://huggingface.co/megagonlabs/).

Model name Task Training setting
megagonlabs/cocosum-cont-self Contrastive Self-supervision
megagonlabs/cocosum-cont-few Contrastive Few-shot
megagonlabs/cocosum-comm-self Common Self-supervision
megagonlabs/cocosum-comm-few Common Few-shot



You can generate contrastive and common opinion summaries by combining two base models, as referred to Collaborative-decoding. After making the CoCoTrip dataset, you can just run the following to download our pre-trained model from huggingface model hub and generate the summaries.

# Co-Decoding for Contrastive Opinion Summarization
python decode.py \
  ./data/ \
  cont \
  gen/cont/codec/ \  # directory to store generated summaries
  megagonlabs/cocosum-cont-few \  # Target model
  --counter_model_checkpoint megagonlabs/cocosum-cont-few \  # Counterpart model
  --alpha 0.2 \  # hyper-parameter
  --top_p 0.9

# Co-Decoding for Common Opinion Summarization  
python decode.py \
  ./data/ \
  comm \
  gen/comm/codec \
  megagonlabs/cocosum-comm-few \  # Target model
  --counter_model_checkpoint megagonlabs/cocosum-cont-few \  # Counterpart model. The contrastive summarization model is used in this case.
  --alpha 0.4 \  # hyper-parameter
  --top_p 0.9 \
  --do_ens_tgt \
  --do_ens_cnt \
  --ens_method add  # combining the output by summing up



After building CoCoTrip dataset, you can train the base contrastive and common opinion summarization models by running the following commands using self-supervised dataset!!

# Contrastive Summarization Model -- Self-supervision
python train.py \
  ./data/train_cont_50k.jsonl \
  --default_root_dir ./log/cont/self \
  --accumulate_grad_batches 8 \
  --gradient_clip_val 1.0 \
  --max_steps 50000 \
  --warmup 1000 \
  --val_check_interval 5000 \
  --task cont \
  --gpus 1
# Common Opinion Summarization -- Self-supervision
python train.py \
  ./data/train_comm_pair_1k.jsonl \
  --default_root_dir ./log/comm/self \
  --accumulate_grad_batches 8 \
  --gradient_clip_val 1.0 \
  --max_steps 5000 \
  --warmup 100 \
  --val_check_interval 500 \
  --task comm \
  --use_pair \
  --gpus 1

If you want to further train the model on top of self-supervised opinion summarization model, you can run the followings:

# Contrastive Summarization Model -- Few-Shot
python train.py \
  ./data/few_cont.jsonl \
  --default_root_dir ./log/cont/few \
  --accumulate_grad_batches 8 \
  --gradient_clip_val 1.0 \
  --max_steps 1000 \
  --warmup 100 \
  --val_check_interval 100 \
  --task cont \
  --ckpt log/cont/self/lightning_logs/version_0/checkpoints/ \
  --gpus 1

# Common Opinion Summarization -- Few-Shot
python train.py \
 ./data/few_comm_pair.jsonl \
 --default_root_dir ./log/comm/few \
 --accumulate_grad_batches 8 \
 --gradient_clip_val 1.0 \
 --max_steps 1000 \
 --warmup 100 \
 --val_check_interval 100 \
 --task comm \
 --use_pair \
 --ckpt log/comm/self/lightning_logs/version_0/checkpoints/ \
 --gpus 1 


Once you train the models for both contrastive and common opinion summarizations, it's time to generate summaries using Co-decoding!

# Co-Decoding for Contrastive Opinion Summarization
python decode.py \
  ./data/ \
  cont \
  gen/cont/codec/ \  # directory to store generated summaries
  log/cont/few/lightning_logs/version_0/checkpoints/ \  # Target model
  --counter_model_checkpoint log/cont/few/lightning_logs/version_0/checkpoints/ \  # Counterpart model
  --alpha 0.2 \  # hyper-parameter
  --top_p 0.9

# Co-Decoding for Common Opinion Summarization  
python decode.py \
  ./data/ \
  comm \
  gen/comm/codec \
  ./log/comm/few/lightning_logs/version_0/checkpoints/ \  # Target model
  --counter_model_checkpoint log/cont/few/lightning_logs/version_0/checkpoints/ \  # Counterpart model. The contrastive summarization model is used in this case.
  --alpha 0.4 \  # hyper-parameter
  --top_p 0.9 \
  --do_ens_tgt \
  --do_ens_cnt \
  --ens_method add  # combining the output by summing up


Finally, you can evaluate your generated summaries with all the evaluation metrics by running the following command!

python evaluate.py \ 
  ./data/ \
  ./gen/cont/codec/outputs.json \  # path of the generated contrastive summaries
  ./gen/comm/codec/outputs.json  # path of the generated common summaries


The repository is build based on the naacl2021-longdoc-tutorial .


