Steps to run the code
sahulsumra opened this issue · 5 comments
Can you explain me how to run this code?
Hi @sahulsumra, thanks for your interest in our work! Have you tried following the instructions for running in the readme?
Hi @abertsch72, my first problem was getting the conda environment setup on the basis of "requirements.txt". Not sure if you are working within a conda environment? But doing so might help to isolate exactly what needs installing.
So, aside from some packages that were absent from your "requirements.txt" file, I managed to get the inference_example.py working fine. Working with a decent-sized gpu on the cluster, so it's fast and very happy with that. (Thumbs up.)
But I had a bunch of problems with "src/run.py", when I try to run:
python src/run.py \
src/configs/training/base_training_args.json \
src/configs/data/gov_report.json \
--output_dir output_train_bart_base_local/ \
--learning_rate 1e-5 \
--model_name_or_path facebook/bart-base \
--max_source_length 1024 \
--eval_max_source_length 1024 --do_eval=True \
--eval_steps 1000 --save_steps 1000 \
--per_device_eval_batch_size 1 --per_device_train_batch_size 2 \
--extra_metrics bertscore
I get:
File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 1180, in <module>
main()
File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 437, in main
seq2seq_dataset = _get_dataset(data_args, model_args, training_args)
File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 943, in _get_dataset
seq2seq_dataset = load_dataset(
File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/load.py", line 1657, in load_dataset
builder_instance = load_dataset_builder(
File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/load.py", line 1515, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 1022, in __init__
super(GeneratorBasedBuilder, self).__init__(*args, **kwargs)
File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 259, in __init__
self.config, self.config_id = self._create_builder_config(
File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 366, in _create_builder_config
raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.")
ValueError: BuilderConfig SLEDConfig(name='gov_report', version=1.0.0, data_dir=None, data_files=None, description='\n@inproceedings{huang-etal-2021-efficient,\n title = "Efficient Attentions for Long Document Summarization",\n author = "Huang, Luyang and\n Cao, Shuyang and\n Parulian, Nikolaus and\n Ji, Heng and\n Wang, Lu",\n booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",\n month = jun,\n year = "2021",\n address = "Online",\n publisher = "Association for Computational Linguistics",\n url = "https://aclanthology.org/2021.naacl-main.112",\n doi = "10.18653/v1/2021.naacl-main.112",\n pages = "1419--1436",\n abstract = "The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder-decoder attention with head-wise positional strides to effectively pinpoint salient information from the source. We further conduct a systematic study of existing efficient self-attentions. Combined with Hepos, we are able to process ten times more tokens than existing models that use full attentions. For evaluation, we present a new dataset, GovReport, with significantly longer documents and summaries. Results show that our models produce significantly higher ROUGE scores than competitive comparisons, including new state-of-the-art results on PubMed. Human evaluation also shows that our models generate more informative summaries with fewer unfaithful errors.",\n}')
doesn't have a 'verification_mode' key.
I found that this verification_mode variable is new to hf. (The solution, for now, seems to be to go to the run.py file and comment "verification_mode" out and replace it with the soon-to-deprecated "ignore_verifications=True")
Hoping the above is useful to some.
It's strange because I would expect the following lines from src/run.py
# Preprocessing the datasets.
# We need to tokenize inputs and targets.
if training_args.do_train:
column_names = seq2seq_dataset["train"].column_names
to pick up the column names from the ccdv/govreport-summarization dataset itself. Why doesn't it?
Anyway, I got it working by rewriting your deduplicate function and instead having it assign an "id" column. Most first-time users would be using this with a standard dataset such as ccdv/govreport-summarization. So no need for deduping. I also needed to change the column_names assignment within the run.py file. Seems like a bit of work is needed to make this more streamlined and accessible.
Ah okay, I now understand. In your original gov_report.json file:
"dataset_name": "tau/sled",
"dataset_config_name": "gov_report",
"max_source_length": 16384,
"generation_max_length": 1024,
"max_prefix_length": 0,
"pad_prefix": false,
"num_train_epochs": 10,
"metric_names": ["rouge"],
"metric_for_best_model": "rouge/geometric_mean",
"greater_is_better": true
Selects the gov_report dataset within "tau/sled". Okay, that now makes sense. I will leave the above trail for others that go down the same rabbit hole. (I am still confused about what exactly an epoch is here. Why I don't see 17.5k when I run ccdv/govreport-summarization with "num_train_epochs": 1? Instead, I see 1000/8759, 2000/8759, ...)
Finally, at the end of training, got the following error:
Traceback (most recent call last):
File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 1181, in <module>
main()
File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 802, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
return inner_training_loop(
File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/scratch/project_mnt/S0066/unlimiformer-orig/src/utils/custom_seq2seq_trainer.py", line 300, in evaluate
output.metrics.update(self.compute_metrics(*eval_preds))
File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 45, in __call__
return self._compute_metrics(id_to_pred, id_to_labels)
File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 60, in _compute_metrics
result = metric(id_to_pred_decoded, id_to_labels_decoded, is_decoded=True)
File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 27, in __call__
return self._compute_metrics(id_to_pred, id_to_labels)
File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 158, in _compute_metrics
return self._metric.compute(**self.convert_from_map_format(id_to_pred, id_to_labels), **self.kwargs)
File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/metric.py", line 419, in compute
os.remove(file_path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/uqpocall/.cache/huggingface/metrics/bert_score/default/default_experiment-1-0.arrow'
Any suggestions for fixing this last one? Thanks in advance!