AMRCoT Project

This repo contains the code and data for the paper:

Analyzing the Role of Semantic Representations in the Era of Large Language Models (2023)

Zhijing Jin*, Yuen Chen*, Fernando Gonzalez Adauto*, Jiayi Zhang, Jiarui Liu, Julian Michael, Bernhard Schölkopf, Mona Diab (*: Co-first author)

File Structure

code/: Contains the codes for the Tasks 0-8 described below.
data/: For the source data, please download the data files from this google drive folder (containing the CSVs for all the datasets) to the local data/ folder. The existing files in the local data/ folder contains the AMRs of all datasets parsed using AMR3-structbart-L, text input for prompt generation, and input for Task 2 and default Task 6.

Task 0: Get AMRs

We use the library transition-amr-parser to get AMRs from sentences. The script to get the AMRs can be found in code/predict_amr.py.

Task 1: Get LLMs' inference performance

To use efficiency package, which saves gpt queries into a cache automatically, run the following code:

pip install efficiency

This script is used to call the OpenAI API and get the LLMs' inference performance for the selected task.

Pass the input data file, the AMR file, the dataset, the amr flag, and model version as arguments to the script. For example:

python code/general_request_chatbot.py --data_file data/classifier_inputs/updated_data_input_classifier_input.csv --amr_file data/corrected_amrs.csv --dataset logic --amr_cot --model_version gpt4

To get LLMs' response on SPIDER dataset, run the following code:

python code/general_request_spider.py --amr_cot --model_version gpt4

The outputs are stored in a csv file in data/outputs/{model_version}/requests_direct_{dataset}.csv
To get the results for all the datasets, run the following code:

python code/eval_gpt.py --data_file {file_to_evaluate} --dataset {dataset}

For example:

python code/eval_gpt.py --data_file data/outputs/gpt-4-0613/requests_direct_logic.csv --dataset logic

Task 2: Binary classification of when AMRs help

How to train

To train a binary classifier to predict when AMRs help and when LLMs fail,

installed the required packages.

python -r code/BERTBinaryClassification/requirements.txt

Download this data folder from google drive and put it under the code/BERTBinaryClassification directory.
Run code/BERTBinaryClassification/train.ipynb.

Task 3: Get a comprehensive list of linguistic features

We generate the features in the Text Characterization Toolkit (Simig et al., 2022; this repo), as well as our own proposed features.

(In current implementation, we assume the text-characterization-toolkit is in the same directory as this repo. ie ../text-characterization-toolkit)

python code/get_features.py --dataset paws --output_dir ../data/featured

Task 4: Get the correlation between linguistic features and AMR helpfulness

We combine all datasets into one csv file, and compute the correlation between linguistic features (features which >90% of the data has) and AMR helpfulness.

python code/combine_features.py

Task 5: Regress AMR helpfulness on linguistic features (Table 8)

We fit traditional machine learning methods, such as logistic regression, decision tree, random forest, XGBoost, and ensemble models, to predict AMR helpfulness using linguistic features:

python code/train_basics.py

Task 6: Ablation study: Cutting input text or AMR to see how it affects the performance

python amr_cot_ablation.py --dataset entity_recog_gold --cut_col amr --ratio 0.5 --output_dir data/ablation --model_version gpt-4-0613

The output is stored in a csv file in {output_dir}/{dataset}_{model_version}_{cutcol}.csv

To plot the results, run the following code:

python code/plot_ablation.py --data_file ./data/ablation/entity_recog_gold_gpt-4-0613_text.csv --cut_col amr

The plot is stored in data/ablation/{dataset}_{model_version}_{cut_col}.png The summary csv is stored in data/ablation/{dataset}_{model_version}_{cut_col}_summary.csv.

Task 7: Composing the GoldAMR-ComposedSlang dataset

As an intermediate step of constructing the GoldAMR-ComposedSlang dataset, we let gpt-3.5-turbo-0613 to identify candidate slang usage:

python create_slang.py

Task 8: Human evaluation of LLMs' reasoning ability over AMR

We annotate 50 samples from the PAWS dataset, and ask human annotators to evaluate the correctness of LLMs reasoning over AMR based on the following criteria:

The commonalities and differences between the two AMRs are correctly identified.
Drawing on the commonalities and differences, the LLMs can correctly infer the relationship between the two sentences.

The annotation results can be found here.

Contact

For coding and data questions,

Please first open a GitHub issue.
If you want a more speedy response, please link your GitHub issue when emailing any of the student authors on this paper: Yuen Chen, Fernando Gonzalez, and Jiarui Liu.
We will reply to your email and directly answer on the GitHub issue, so more people can benefit if they have similar questions.

For future collaborations or further requests,

Feel free to email Zhijing Jin and Yuen Chen.

causalNLP/amr_llm