/sdg4idrr

Synthetic Data Generation for Implicit Discourse Relation Recognition (SDG4IDRR)

Primary LanguagePython

Synthetic Data Generation for Implicit Discourse Relation Recognition

This repository contains scripts of Synthetic Data Generation for Implicit Discourse Relation Recognition (SDG4IDRR).

Requirements

  • Python 3.9
    • poetry
      pip install poetry
    • Dependencies: see pyproject.toml
  • Java (required to install hydra)
    # command example
    wget https://download.java.net/java/GA/jdk17.0.2/dfd4a8d0985749f896bed50d7138ee7f/8/GPL/openjdk-17.0.2_linux-x64_bin.tar.gz
    tar -xf openjdk-17.0.2_linux-x64_bin.tar.gz
    mv jdk-17.0.2/ $HOME/.local/
    export PATH="$HOME/.local/jdk-17.0.2/bin:$PATH"

Set up Python Virtual Environment

git clone git@github.com:facebookresearch/hydra.git -b v1.3.2 src/hydra
git clone git@github.com:princeton-nlp/SimCSE.git -b 0.4 src/SimCSE
rsync -av patch/src/ src/
poetry install [--no-dev]

# make a .env file
echo "OPENAI_API-KEY=<OPENAI_API-KEY>" >> .env
echo "OPENAI-ORGANIZATION=<OPENAI-ORGANIZATION>" >> .env

(optional) Set up pre-commit

# pip install pre-commit
pre-commit install

Command Examples

Build Dataset
# obtain Penn Discourse Treebank Version 3.0 (cf. https://catalog.ldc.upenn.edu/LDC2019T05)

# confirm help message of IN_ROOT argument
poetry run python scripts/build_pdtb3_dataset.py -h
# build PDTB3 dataset
poetry run python scripts/build_pdtb3_dataset.py \
  path/to/IN_ROOT/ \
  dataset/ \
  --aid-dir data/article_ids/
Rebuild Synthetic Data
# rebuild synthetic data from PDTB3 dataset and annotations
poetry run python scripts/rebuild_synthetic_data.py \
  dataset/ji/train.jsonl \
  data/annot/ \
  data/synth/filtered/

Since synthetic data was generated using GPT-4, please refer to the OpenAI's terms of use. For instance, you may not use it to develop models that compete with OpenAI.

Compile
# compile synthetic data based on a confusion matrix
poetry run python scripts/compile.py \
  data/synth/filtered/ \
  results/run_id/dev_pred.jsonl \
  data/synth/compiled/run_id/examples.jsonl \
  [--top-k int]

# reproduce synthetic data for RoBERTa-base/large
./scripts/compile.sh [-h | --help]

Investigate Few-Shot Performance of ChatGPT
# investigate few-shot performance of ChatGPT on PDTB3 dataset
poetry run python scripts/preliminary/investigate_few-shot_performance_of_chatgpt.py \
  dataset/ji/train.jsonl \
  dataset/ji/test.jsonl \
  results/few-shot/gpt-4-0613.jsonl \
  [--dry-run]
Generate Candidates of Arg2
# generate candidates of Arg2 using GPT-4 based on a confusion matrix
poetry run python scripts/generate_candidates_of_arg2.py \
  dataset/ji/train.jsonl \
  results/run_id/dev_pred.jsonl \
  data/synth/unfiltered/ \
  [--top-k int] \
  [--dry-run]
Filter Synthetic Argument Pairs
# filter synthetic argument pairs using GPT-4 based on a confusion matrix
poetry run python scripts/filter_synthetic_argument_pairs.py \
  data/synth/unfiltered/ \
  dataset/ji/train.jsonl \
  results/run_id/dev_pred.jsonl \
  data/synth/filtered/ \
  [--top-k int] \
  [--dry-run]

Reference/Citation

@inproceedings{omura-etal-2024-empirical,
  title = "{A}n {E}mpirical {S}tudy of {S}ynthetic {D}ata {G}eneration for {I}mplicit {D}iscourse {R}elation {R}ecognition",
  author = "Omura, Kazumasa and
    Cheng, Fei and
    Kurohashi, Sadao",
  booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)",
  month = may,
  year = "2024",
  address = "Turin, Italy",
}