Why "import sled" was commented out in run.py?

Question

Why "import sled" was commented out in run.py?

shi-kejian opened this issue a year ago · 4 comments

Hi,

Thank you again for this great effort.

As title reads, why the current commit of run.py has
from sled import SledConfig and
import sled # *** required so that SledModels will be registered for the AutoClasses ***
were commented out?
May I ask if the import is no longer needed in default setting?

https://github.com/abertsch72/unlimiformer/blob/651c5b37d96d676e1da32e36b05dc388bcc440e4/src/run.py#L31C28-L31C28

File "/unlimiformer/src/unlimiformer.py", line 814, in convert_model
type_to_class[type(model)](model, *args, **kwargs)

KeyError: <class 'sled.modeling_sled.SledForConditionalGeneration'>

Answer 1 · 2023-10-10T18:58:54.000Z

Hi @shi-kejian ,
Thank you for your interest in our work!

We based our implementation on SLED's code, so there might be some leftovers from SLED.

What is the command line that you were running when encountering this error?

Best,
Uri

Answer 2 · 2023-10-10T19:29:50.000Z

Oh. Thank you. I am actually following up on #49

I'm trying to use either facebook/bart-base or sled on my local dataset.

python src/run.py
src/configs/training/base_training_args.json
src/configs/data/my_own_data.json
--model_name_or_path tau/bart-large-sled
--use_auth_token false
--overwrite_cache
--output_dir output_train_bart_large_meeting_oracle/
--overwrite_output_dir
--max_source_length 1024
--eval_max_source_length 999999
--generation_max_length 640
--max_target_length 640
--max_prefix_length 96
--do_eval=True
--learning_rate 1e-5
--per_device_eval_batch_size 1
--per_device_train_batch_size 2
--unlimiformer_training=True
--test_unlimiformer
--eval_steps 30 --save_steps 30
--num_train_epochs 10
--metric_names rouge
--extra_metrics bertscore
--metric_for_best_model bertscore
--fp16 \

My data.json:

{
"dataset_name": "<my_hf_handle_dir_with_train_dev_test>",
"dataset_config_name": "default",
"max_source_length": 16384,
"generation_max_length": 640,
"max_prefix_length": 96,
"pad_prefix": true,
"num_train_epochs": 10,
"input_column": "Article",
"input_prefix_column": "Query",
"output_column": "Summary",
"metric_names": ["rouge"],
"metric_for_best_model": "rouge/geometric_mean",
"greater_is_better": true
}

Thank you!

Answer 3 · 2023-10-10T20:58:33.000Z

Do you manage to run with the existing datasets, e.g., GovReport?
If you copy the gov_report.json to my_data.json and just change the variables at my_data.json to point to your datasets, does it work?

Answer 4 · 2023-10-10T23:58:37.000Z

Thank you.

It turns out that adding --fp16 flag will break.

  File "/ext3/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))

Sticking with fp32 can circumvent the issue.

Do you manage to run with the existing datasets, e.g., GovReport?

Running the quickstart for GovReport actually broke for me.

/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
reproduce.bash: line 11: 2960178 Segmentation fault (core dumped)

If you copy the gov_report.json to my_data.json and just change the variables at my_data.json to point to your datasets, does it work?

For single GPU:
File "unlimiformer/src/run.py", line 802, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2665, in training_step
self.accelerator.backward(loss)
........
.......
....
^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 244, in backward
tensors = ctx.saved_tensors
^^^^^^^^^^^^^^^^^
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

For multi-GPU:
It again hits #49

So for single GPU it fortunately starts backward

So for 1. - reproducing standard finetuning- I got segmentation fault. Is it just me or someone else is getting the issue? Do you happen to have some experience / encountered this before?

Thank you very much!