- Python 3.10
- CUDA 11.7
pip install -r requirements.txt
- CoNLL evaluation script saved as
conlleval
in the project directory
The disaggregated span annotation data must be stored in a DataFrame with at least 6 columns:
Sample_Text
, which is the problem description text;Type
, which must always equal the string"highlight"
(otherwise the row will be ignored);Area_of_Law
, which is an area of law assigned to the problem description;User_ID
, which is the unique identifier of the expert annotator;Tagged_Text
, which is the annotated span of text in the problem description; andTagged_Text_Start
, which is the start index (0-based) of the annotated span in the problem description.
An example DataFrame is
pd.DataFrame(
{
"Sample_Text": [
"foo\nbar",
"foo\nbar",
"bar",
"bar",
"bar",
"baz",
"baz",
],
"Type": [
"highlight",
"highlight",
"highlight",
"highlight",
"highlight",
"highlight",
"highlight",
],
"Area_of_Law": ["A", "B", "B", "A", "A", "C C", "C C"],
"User_ID": [1, 1, 1, 2, 1, 3, 3],
"Tagged_Text": ["foo", "bar", "ba", "ar", "r", "b", "az"],
"Tagged_Text_Start": [0, 4, 0, 1, 2, 0, 1],
}
)
where it contains the span annotations of 3 problem descriptions. The first description is
foo
bar
which was assigned to areas of law "A" and "B" by Expert 1. The expert also annotated the words foo and bar to support their areas of law assignment respectively. The second description is bar assigned to areas of law:
- "A" and "B" by Expert 1, who annotated the character r and the span of characters ba to support their assignments respectively; and
- "A" by Expert 2, who annotated ar to support their assignment.
The third description is baz assigned to area of law "C C" by Expert 3, who annotated spans b and az to support the assignment.
To create a 20-fold cross-validated dataset, run
./create_span_prediction_dataset.py -k 20 --dev 0.1 --shuffle /path/to/dataframe.pkl /path/to/output/dir
which shuffles the data before splitting and also samples 10% of the training portion of each fold to create a development set. The script creates 20 directories named 0, 1, and so on under the output directory, each corresponds to a fold. Under each directory, there are {train,dev,test}.jsonl
files containing the span-annotated examples.
The gold spans in {dev,test}.jsonl
are majority-voted. To allow evaluation with the best-matched spans, get all the disaggregated annotations with
./create_span_prediction_dataset.py /path/to/dataframe.pkl /path/to/output/dir
which saves them in /path/to/output/dir/all.jsonl
file.
To train the BERT-based span predictor model, run
./run_span_predictor.py with data_dir=/path/to/fold/dir method=<method>
where /path/to/fold/dir
contains the {train,dev,test}.jsonl
files, and <method>
is Random
, ReL
, or MV
for the random baseline, repeated labelling, and majority voting methods respectively.
To train the DeBERTaV3-based model, run
./run_span_predictor.py with deberta_v3_base data_dir=/path/to/fold/dir method=<method>
By default, the command saves the trained models, predictions, etc. in artifacts
directory. Run
./run_span_predictor.py print_config
to see other configurable settings.
To perform evaluation with the conlleval
script, run
./run_span_eval.py with pred_path=/path/to/artifacts/dir/test.tsv
By default, the script will run span-level evaluation against the majority-voted gold spans. To run word-level evaluation instead, pass is_raw=True
as an argument:
./run_span_eval.py with pred_path=/path/to/artifacts/dir/test.tsv is_raw=True
To evaluate against the best-matched gold spans, specify the disaggregated span annotations as disagg_test_path
:
./run_span_eval.py with pred_path=/path/to/artifacts/dir/test.tsv disagg_test_path=/path/to/output/dir/all.jsonl
Both run_span_predictor.py
and run_span_eval.py
scripts use Sacred and have its MongoObserver
activated by default. Set SACRED_MONGO_URL
(and optionally SACRED_DB_NAME
) environment variable(s) to write experiment runs to a MongoDB instance. For example, set SACRED_MONGO_URL=mongodb://localhost:27017
if the MongoDB instance is listening on port 27017 on the local machine.
Tests can be run with the command pytest
. By default, slow tests are excluded. To run them, pass -m slow
as an argument.
If you use our work, please cite:
@inproceedings{kurniawan2024,
title = {To {{Aggregate}} or {{Not}} to {{Aggregate}}. {{That}} Is the {{Question}}: {{A Case Study}} on {{Annotation Subjectivity}} in {{Span Prediction}}},
booktitle = {Proceedings of the 14th {{Workshop}} on {{Computational Approaches}} to {{Subjectivity}}, {{Sentiment}}, \& {{Social Media Analysis}}},
author = {Kurniawan, Kemal and Mistica, Meladel and Baldwin, Timothy and Lau, Jey Han},
year = {2024},
pages = {362--368},
}