Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples
We present a Diagnostic dataset of IA References in a Pentomino domain (Pento-DIARef) that ties extensional and intensional definitions more closely together, insofar as the latter is the generative process creating the former.
We create a novel synthetic dataset of examples that pairs visual scenes with generated referring expressions; examine two variants of the dataset, representing two different ways to exemplify the underlying task; and evaluate an LSTM-based baseline, a transformer and a modified version with region embeddings on them.
NLP tasks are typically defined extensionally through datasets containing example instantiations (e.g., pairs of image i and text t), but motivated intensionally through capabilities invoked in verbal descriptions of the task (e.g., "t is a description of i, for which the content of i needs to be recognised and understood"). We present Pento-DIARef, a diagnostic dataset in a visual domain of puzzle pieces where referring expressions are generated by a well-known symbolic algorithm (the "Incremental Algorithm"), which itself is motivated by appeal to a hypothesised capability (eliminating distractors through application of Gricean maximes). Our question then is whether the extensional description (the dataset) is sufficient for a neural model to pick up the underlying regularity and exhibit this capability given the simple task definition of producing expressions from visual inputs. We find that a model supported by a vision detection step and a targeted data generation scheme achieves an almost perfect BLEU@1 score and sentence accuracy, whereas simpler baselines do not.
@inproceedings{sadler-2023-pento-diaref,
title = "Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples",
author = "Sadler, Philipp and Schlangen, David",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
month = "may",
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
}
@misc{sadler-2023-pento-diaref-dataset,
title = "Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples",
author = "Sadler, Philipp and Schlangen, David",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
month = "may",
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
doi = "10.5281/zenodo.7625619",
howpublished= "\url{https://doi.org/10.5281/zenodo.7625619}"
}
This section covers a step-by-step guide of how to use the provided scripts and sources.
Checkout the repository
Install the requirements:
pip install -r requirements.txt
For all commands we assume that you are in the top level project directory and executed in before:
source prepare_path.sh
Create the data directory at a path of your choice.
mkdir -p /data/pento_diaref/didact
And copy the required files into the directory
cp resources/* /data/pento_diaref/didact
Then execute the script
python3 scripts/generate_annos_didactic.py \
--data_dir /data/pento_diaref/didact \
--train_num_sets_per_utterance_type 10 \
--test_num_sets_per_utterance_type 1 \
--gid_start 0 \
--seed 42
This will create 148,400/10,000/10,000
in-distribution samples for training/validation/testing
and the 756/840/840
out-of-distribution (holdout target piece symbols) samples
for the color, position and utterance type generalization tests.
The script additionally filters out training samples where the extra target selection accidentally produced a sample that has an utterance type reserved for the uts-holdout. So the remaining number of in-distribution training samples is probably between 120k-130k.
Note: During training, we only use the in-distribution validation samples for model selection.
Create the data directory at a path of your choice.
mkdir -p /data/pento_diaref/naive
And copy the required files into the directory
cp resources/* /data/pento_diaref/naive
Then execute the script
python3 scripts/generate_annos_naive.py -ho \
--data_dir /data/pento_diaref/naive \
--with_ho \
--gid_start 1_000_000 \
--seed 42
This will create 148,400/10,000/10,000
in-distribution samples for training/validation/testing
using the same target piece symbols as above. For generalization testing we use the holdouts splits generated above.
Note: The holdouts computation is deterministic and only depends on the order
in the color, shape and position listings because we use itertools.product(colors, shapes, positions)
.
Thus, the target piece symbols seen during training are the same as for DIDACT
.
We briefly check the number of target piece symbols contained in the in-distribution samples.
These might be a bit lower for the DIDACT
training, because we removed unintended samples for the uts-holdout.
Overall the numbers should not vary too much between DIDACT
and NAIVE
(ideally be zero).
python3 scripts/generate_annos_check.py \
--didact_dir /data/pento_diaref/didact \
--naive_dir /data/pento_diaref/naive
The generation process takes about an hour (more or less depending on the machine).
python3 scripts/generate_images_didactic.py \
--data_dir /data/pento_diaref/didact \
--image_size 224 224 \
--category_name all \
--seed 42
The generation process takes about an hour (more or less depending on the machine).
python3 scripts/generate_images_naive.py \
--data_dir /data/pento_diaref/naive \
--image_size 224 224 \
--seed 42
{'id': 148294,
'group_id': 37073,
'size': 6,
'pieces': [('orange', 'Y', 'top center', 0),
('orange', 'Z', 'top right', 180),
('orange', 'W', 'bottom center', 90),
('orange', 'V', 'right center', 0),
('orange', 'V', 'bottom right', 0),
('orange', 'P', 'top center', 180)],
'target': 3,
'refs': [{'user': 'ia',
'instr': 'Take the V in the right center',
'type': 5,
'sent_type': 1266,
'props': {'shape': 'V', 'rel_position': 'right center'}}],
'bboxes': [[82, 112, 44, 59],
[164, 186, 37, 59],
[134, 156, 171, 194],
[156, 179, 112, 134],
[194, 216, 179, 201],
[126, 141, 67, 89]],
'global_id': 0,
'split_name': 'data_train'
}
Note: The group_id
points to the image in the hdf5 file.
The models will be saved to saved_models
in the project folder.
The data mode sequential_generation
is assumed (should not be changed)
python3 scripts/train_classifier_vse.py \
--data_dir /data/pento_diaref/didact \
--logdir /cache/tensorboard-logdir \
--gpu 7 \
--model_name classifier-vse-didact \
--batch_size 24 \
--d_model 512
The data mode sequential_generation
is assumed (should not be changed)
python3 scripts/train_classifier_vse.py \
--data_dir /data/pento_diaref/naive \
--logdir /cache/tensorboard-logdir \
--gpu 6 \
--model_name classifier-vse-naive \
--batch_size 24 \
--d_model 512
We use the data mode sequential_generation
python3 scripts/train_transformer.py \
--data_dir /data/pento_diaref/didact \
--logdir /cache/tensorboard-logdir \
--gpu 4 \
--model_name transformer-vse-didact \
--data_mode sequential_generation \
--batch_size 24 \
--d_model 512 \
--dim_feedforward 1024 \
--num_encoder_layers 3 \
--num_decoder_layers 3 \
--n_head 4 \
--dropout 0.2
We use the data mode sequential_generation
python3 scripts/train_transformer.py \
--data_dir /data/pento_diaref/naive \
--logdir /cache/tensorboard-logdir \
--gpu 3 \
--model_name transformer-vse-naive \
--data_mode sequential_generation \
--batch_size 24 \
--d_model 512 \
--dim_feedforward 1024 \
--num_encoder_layers 3 \
--num_decoder_layers 3 \
--n_head 4 \
--dropout 0.2
We use the data mode default_generation
python3 scripts/train_transformer.py \
--data_dir /data/pento_diaref/didact \
--logdir /cache/tensorboard-logdir \
--gpu 2 \
--model_name transformer-didact \
--data_mode default_generation \
--batch_size 24 \
--d_model 512 \
--dim_feedforward 1024 \
--num_encoder_layers 3 \
--num_decoder_layers 3 \
--n_head 4 \
--dropout 0.2
We use the data mode default_generation
python3 scripts/train_transformer.py \
--data_dir /data/pento_diaref/naive \
--logdir /cache/tensorboard-logdir \
--gpu 1 \
--model_name transformer-naive \
--data_mode default_generation \
--batch_size 24 \
--d_model 512 \
--dim_feedforward 1024 \
--num_encoder_layers 3 \
--num_decoder_layers 3 \
--n_head 4 \
--dropout 0.2
The data mode default_generation
is assumed (should not be changed)
python3 scripts/train_lstm.py \
--data_dir /data/pento_diaref/didact \
--logdir /cache/tensorboard-logdir \
--gpu 0 \
--gpu_fraction 0.3 \
--model_name lstm-didact \
--batch_size 24 \
--lstm_hidden_size 1024 \
--word_embedding_dim 512 \
--dropout 0.5
The data mode default_generation
is assumed (should not be changed)
python3 scripts/train_lstm.py \
--data_dir /data/pento_diaref/naive \
--logdir /cache/tensorboard-logdir \
--gpu 0 \
--gpu_fraction 0.3 \
--model_name lstm-naive \
--batch_size 24 \
--lstm_hidden_size 1024 \
--word_embedding_dim 512 \
--dropout 0.5
Choose the best model for each case and move them to the saved_models
folder.
We used the model with the highest BLEU score and if equal, the most epochs trained on.
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--results_dir results \
--model_dir saved_models \
--model_name <model_name> \
--stage_name <stage_name> \
--gpu 0
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--model_name lstm-naive \
--stage_name test \
--gpu 0 \
--gpu_fraction 0.3
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--model_name transformer-naive \
--stage_name test \
--gpu 0
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--model_name transformer-vse-naive \
--stage_name test \
--gpu 0
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--model_name classifier-vse-naive \
--stage_name test \
--gpu 0
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--model_name lstm-didact \
--stage_name test \
--gpu 0 \
--gpu_fraction 0.3
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--model_name transformer-didact \
--stage_name test \
--gpu 0
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--model_name transformer-vse-didact \
--stage_name test \
--gpu 0
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--model_name classifier-vse-didact \
--stage_name test \
--gpu 0
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--model_name transformer-vse-didact \
--stage_name test \
--ablation_mode replace_random \
--gpu 0
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--model_name transformer-vse-didact \
--stage_name test \
--ablation_mode random_types \
--gpu 0
python3 scripts/evaluate_model.py \
--data_dir /data/pento_diaref/didact \
--model_name transformer-vse-didact \
--stage_name test \
--ablation_mode random_regions \
--gpu 0
We need to reference the data dir to load the annotations and lookup get the category name.
python3 scripts/evaluate_results.py \
--data_dir /data/pento_diaref/didact \
--results_dir results \
--stage_name test
Original results can be found in the folder original_results