Address issues that arise during the training of the repository.

Question

Address issues that arise during the training of the repository.

kisejin opened this issue 9 months ago · 4 comments

Hello, thanks for sharing your code.

Currently, when I run this repository, I encounter some issues, and I have provided solutions that seem to work well. I am sharing this with everyone in the hope of receiving approval from the author.

silent in read_line_examples_from_file isn't set default:
Solve:

def read_line_examples_from_file(data_path, silence=False):
    .....

Can't assign hparams in model T5FineTunerwith new version lightning >= 2.0.0:
Solve:

self.hparams.update(vars(hparams))

New version lightning isn't support training_epoch_end and training_epoch_end, so adding prefix on_ before name of these functions
Lightning has integrated the gradient step in the 20 hooks, so I commented out the optimizer_step function because I observed issues with the optimizer closure."
Due to the internal separation of train and validation data within the model, it is not possible to assign output parameters to on_validation_epoch_end. Therefore, I removed it and replaced it with the following code snippet:

 class MyLightningModule(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.validation_step_outputs = []

     def validation_step(self, ...):
         loss = ...
         self.validation_step_outputs.append(loss)
         return loss

    def on_validation_epoch_end(self):
        epoch_average = torch.stack(self.validation_step_outputs).mean()
        self.log("validation_epoch_average", epoch_average)
        self.validation_step_outputs.clear()  # free memory

Param gpus does not exist in the Lightning Trainer; instead, add the device parameter with the value 'auto' to automatically detect the available GPUs, and set accelerator='gpu'."
Solve:

    train_params = dict(
        default_root_dir=args.output_dir,
        accumulate_grad_batches=args.gradient_accumulation_steps,
        devices='auto',
        gradient_clip_val=1.0,
        max_epochs=args.num_train_epochs,
        callbacks=[LoggingCallback()],
        accelerator='gpu'
    )

These are the solutions I have referenced online. If there are any errors, please overlook them and feel free to provide additional feedback

Answer 1 · 2024-02-27T11:04:12.000Z

@kisejin That's nicely said.

I would just like to mention one more point

    def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure=None):
        if self.trainer.use_tpu:
            xm.optimizer_step(optimizer)
        else:
          optimizer.step()
          optimizer.zero_grad()
          self.lr_scheduler.step()

The use_tpu is causing an issue as well, so i commented it

   def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure=None):
       optimizer.step()
       optimizer.zero_grad()
       self.lr_scheduler.step()

Answer 2 · 2024-05-17T08:32:23.000Z

@kisejin Thanks for sharing this with us.

However, I am not reproducing the performance of the paper.

Based on the res15 dataset, 200 runs,
I have seen performance between 0.42XX and 0.46XX.

Is anyone else seeing a significant performance difference?

Answer 3 · 2024-05-23T19:31:26.000Z

@ssoyaavv are you testing this on a CPU device?

Answer 4 · 2024-05-24T03:02:21.000Z

@LawrenceMoruye I used a GPU for testing.