Code and data for the NAACL 2024 paper "This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models through Territorial Disputes".
The entire dataset consists of 3 separate datasets: A) the disputed territories (a.k.a. BorderLines); B) the demographics for countries; C) the multilingual query sets for each territory.
You can obtain the dataset by running either option 1, loading from the datasets hub, or option 2, cloning the repository.
BorderLines is available in the datasets hub. Load by running:
import datasets
# load disputed territories
territories = datasets.load_dataset('manestay/borderlines', 'territories')['train']
# load country demographics
countries = datasets.load_dataset('manestay/borderlines', 'countries')['train']
# load queries in 49 languages
queries = datasets.load_dataset('manestay/borderlines', 'queries')
Note: the above code is included in the function load_borderlines_hf
of file run_gpt/lib.py
.
In this repository, we include the data files for the default version of BorderLines (2023-05-15), which is based on the 2023-05-15 article.
The files are:
disputed_territories.csv
: the main BorderLines territorial dispute tablecountries_info.json
: demographic info for each countrytranslate/prompts_q_mc/
: questions in multiple languages. For exampleprompts.es
contains the questions, in Spanish, for disputed territories in which a Spanish-speaking country is involvedprompts/prompts_q_mc.txt
: multiple-choice questions, in English, for each disputed territory. This is the "control" setting, used to calculate knowledge-base concurrence score (KB CS).
To use cloned data files with the evaluation scripts, convert them into the datasets format using:
python scripts/borderlines_to_datasets_format.py -o datasets/v1 -p prompts/prompts_q_mc.txt -td translate/terms -tp disputed_territories.csv -ip countries_info.json -pd translate/prompts_mc_q
If you want to reproduce the dataset, see RECREATE.md
. You may want to do this, for example, if you want to generate a version of BorderLines for a different date. Otherwise, skip to III.
Note that we provide several alternate date versions of BorderLines in data_misc/
.
NOTE: The below commands run on BorderLines v1, downloaded from the datasets hub. If you are running on a local version (i.e. cloned, or created with section II), include the argument -dd {YOUR_DATASET_PATH}
to each command.
For GPT-3 models, we use rank classification. This means that given a query, and choices A and B, we concatenate each choice {query + A, query + B}, calculate the probability of either prompt, and assign the more likely one as the model's response.
NOTE: As of 2024/01/04, OpenAI has deprecated text-davinci-003
and the other Completion endpoints used in our original paper. We recommend using davinci-002
, as shown below.
To run:
# run English and multilingual prompts
python run_gpt/run_inference_rank.py -o outputs/gpt3_dv2 -m davinci-002 --print --batch_size 50 --sleep 10 -k {OPENAI_API_KEY}
Depending on your rate limit for the OpenAI API, you may need to adjust --batch_size
and --sleep
.
For local models (BLOOM, T0, etc.), we use rank classification. This is implemented in rank_outputs/
:
# run English and multilingual prompts
python rank_outputs/main.py -o outputs/bloomz-560m -m bigscience/bloomz-560m --batch_size 24
# run for 7b1, bloom, etc
For GPT-4, we use a parsing approach. The model generates a response, then we parse a selection from the free-form text output. This allows us to perform our prompt modification experiments.
Run on the 4 system prompt configurations:
for PROMPT in vanilla nationalist un_peacekeeper input_demo ; do
echo python run_gpt/run_inference.py -o outputs/gpt4/$PROMPT -m gpt-4 --system run_gpt/system_prompts/$PROMPT.txt --sleep 0
done
After running inference, you will have multiple response files (1 per language). Combine them into a response table by running:
# run for GPT-3
python gen_response_table.py -rd outputs/gpt3_dv2
# run for BLOOMZ 560M
python gen_response_table.py -rd outputs/bloomz-560m
# run for GPT-4 vanilla prompt
# --no_manual flag enabled for simplicity (see below)
python gen_response_table.py -rd outputs/gpt4-0314/vanilla --no_manual
# modify args for outputs from other models and prompts
Note for direct prompting experiments: for GPT-4 responses, we need to parse the answer choices from the output text. The gen_response_table.py
script will attempt to automatically parse at first. Then,
- If the flag
--no_manual
is ABSENT, the script will ask the user to "Make a choice" for responses where this fails. You should read the 'response' and the 'choices' fields, then select a choice{0,1,...}
. - If the flag
--no_manual
is PRESENT, the script will attempt to match the choice that first appears in the responses. After, it will select the 0-th index.
Calculate the CS scores, as seen in Table 2 of the paper:
python calculate_CS.py outputs/gpt3_dv2/response_table.csv
python calculate_CS.py outputs/bloomz-560m/response_table.csv
python calculate_CS.py outputs/gpt4-0314/vanilla/response_table.csv
# modify args for outputs from other models and prompts
@article{li2024land,
title={This Land is \{Your, My\} Land: Evaluating Geopolitical Biases in Language Models through Territorial Disputes},
author={Bryan Li and Samar Haider and Chris Callison-Burch},
year={2024},
journal={2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}
}