abertsch72/unlimiformer

Steps to run the code

sahulsumra opened this issue · 5 comments

Can you explain me how to run this code?

Hi @sahulsumra, thanks for your interest in our work! Have you tried following the instructions for running in the readme?

Hi @abertsch72, my first problem was getting the conda environment setup on the basis of "requirements.txt". Not sure if you are working within a conda environment? But doing so might help to isolate exactly what needs installing.

So, aside from some packages that were absent from your "requirements.txt" file, I managed to get the inference_example.py working fine. Working with a decent-sized gpu on the cluster, so it's fast and very happy with that. (Thumbs up.)

But I had a bunch of problems with "src/run.py", when I try to run:

python src/run.py \
    src/configs/training/base_training_args.json \
    src/configs/data/gov_report.json \
    --output_dir output_train_bart_base_local/ \
    --learning_rate 1e-5 \
    --model_name_or_path facebook/bart-base \
    --max_source_length 1024 \
    --eval_max_source_length 1024 --do_eval=True \
    --eval_steps 1000 --save_steps 1000 \
    --per_device_eval_batch_size 1 --per_device_train_batch_size 2 \
    --extra_metrics bertscore

I get:

  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 1180, in <module>
    main()
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 437, in main
    seq2seq_dataset = _get_dataset(data_args, model_args, training_args)
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 943, in _get_dataset
    seq2seq_dataset = load_dataset(
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/load.py", line 1657, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/load.py", line 1515, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 1022, in __init__
    super(GeneratorBasedBuilder, self).__init__(*args, **kwargs)
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 259, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 366, in _create_builder_config
    raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.")
ValueError: BuilderConfig SLEDConfig(name='gov_report', version=1.0.0, data_dir=None, data_files=None, description='\n@inproceedings{huang-etal-2021-efficient,\n    title = "Efficient Attentions for Long Document Summarization",\n    author = "Huang, Luyang  and\n      Cao, Shuyang  and\n      Parulian, Nikolaus  and\n      Ji, Heng  and\n      Wang, Lu",\n    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",\n    month = jun,\n    year = "2021",\n    address = "Online",\n    publisher = "Association for Computational Linguistics",\n    url = "https://aclanthology.org/2021.naacl-main.112",\n    doi = "10.18653/v1/2021.naacl-main.112",\n    pages = "1419--1436",\n    abstract = "The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder-decoder attention with head-wise positional strides to effectively pinpoint salient information from the source. We further conduct a systematic study of existing efficient self-attentions. Combined with Hepos, we are able to process ten times more tokens than existing models that use full attentions. For evaluation, we present a new dataset, GovReport, with significantly longer documents and summaries. Results show that our models produce significantly higher ROUGE scores than competitive comparisons, including new state-of-the-art results on PubMed. Human evaluation also shows that our models generate more informative summaries with fewer unfaithful errors.",\n}') 
doesn't have a 'verification_mode' key.

I found that this verification_mode variable is new to hf. (The solution, for now, seems to be to go to the run.py file and comment "verification_mode" out and replace it with the soon-to-deprecated "ignore_verifications=True")

Hoping the above is useful to some.

It's strange because I would expect the following lines from src/run.py

    # Preprocessing the datasets.
    # We need to tokenize inputs and targets.
    if training_args.do_train:
        column_names = seq2seq_dataset["train"].column_names

to pick up the column names from the ccdv/govreport-summarization dataset itself. Why doesn't it?

Anyway, I got it working by rewriting your deduplicate function and instead having it assign an "id" column. Most first-time users would be using this with a standard dataset such as ccdv/govreport-summarization. So no need for deduping. I also needed to change the column_names assignment within the run.py file. Seems like a bit of work is needed to make this more streamlined and accessible.

Ah okay, I now understand. In your original gov_report.json file:

"dataset_name": "tau/sled",
"dataset_config_name": "gov_report",
"max_source_length": 16384,
"generation_max_length": 1024,
"max_prefix_length": 0,
"pad_prefix": false,
"num_train_epochs": 10,
"metric_names": ["rouge"],
"metric_for_best_model": "rouge/geometric_mean",
"greater_is_better": true

Selects the gov_report dataset within "tau/sled". Okay, that now makes sense. I will leave the above trail for others that go down the same rabbit hole. (I am still confused about what exactly an epoch is here. Why I don't see 17.5k when I run ccdv/govreport-summarization with "num_train_epochs": 1? Instead, I see 1000/8759, 2000/8759, ...)

Finally, at the end of training, got the following error:

Traceback (most recent call last):
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 1181, in <module>
    main()
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/run.py", line 802, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/utils/custom_seq2seq_trainer.py", line 300, in evaluate
    output.metrics.update(self.compute_metrics(*eval_preds))
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 45, in __call__
    return self._compute_metrics(id_to_pred, id_to_labels)
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 60, in _compute_metrics
    result = metric(id_to_pred_decoded, id_to_labels_decoded, is_decoded=True)
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 27, in __call__
    return self._compute_metrics(id_to_pred, id_to_labels)
  File "/scratch/project_mnt/S0066/unlimiformer-orig/src/metrics/metrics.py", line 158, in _compute_metrics
    return self._metric.compute(**self.convert_from_map_format(id_to_pred, id_to_labels), **self.kwargs)
  File "/home/uqpocall/.conda/envs/unlimiformer/lib/python3.10/site-packages/datasets/metric.py", line 419, in compute
    os.remove(file_path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/uqpocall/.cache/huggingface/metrics/bert_score/default/default_experiment-1-0.arrow'

Any suggestions for fixing this last one? Thanks in advance!