Address issues that arise during the training of the repository.
kisejin opened this issue · 4 comments
Hello, thanks for sharing your code.
Currently, when I run this repository, I encounter some issues, and I have provided solutions that seem to work well. I am sharing this with everyone in the hope of receiving approval from the author.
silent
inread_line_examples_from_file
isn't set default:
Solve:
def read_line_examples_from_file(data_path, silence=False):
.....
- Can't assign
hparams
in modelT5FineTuner
with new versionlightning >= 2.0.0
:
Solve:
self.hparams.update(vars(hparams))
-
New version
lightning
isn't supporttraining_epoch_end
andtraining_epoch_end
, so adding prefixon_
before name of these functions -
Lightning has integrated the gradient step in the 20 hooks, so I commented out the
optimizer_step
function because I observed issues with the optimizer closure." -
Due to the internal separation of train and validation data within the model, it is not possible to assign output parameters to
on_validation_epoch_end
. Therefore, I removed it and replaced it with the following code snippet:
class MyLightningModule(L.LightningModule):
def __init__(self):
super().__init__()
self.validation_step_outputs = []
def validation_step(self, ...):
loss = ...
self.validation_step_outputs.append(loss)
return loss
def on_validation_epoch_end(self):
epoch_average = torch.stack(self.validation_step_outputs).mean()
self.log("validation_epoch_average", epoch_average)
self.validation_step_outputs.clear() # free memory
- Param
gpus
does not exist in the Lightning Trainer; instead, add thedevice
parameter with the value'auto'
to automatically detect the available GPUs, and setaccelerator='gpu'
."
Solve:
train_params = dict(
default_root_dir=args.output_dir,
accumulate_grad_batches=args.gradient_accumulation_steps,
devices='auto',
gradient_clip_val=1.0,
max_epochs=args.num_train_epochs,
callbacks=[LoggingCallback()],
accelerator='gpu'
)
These are the solutions I have referenced online. If there are any errors, please overlook them and feel free to provide additional feedback
@kisejin That's nicely said.
I would just like to mention one more point
def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure=None):
if self.trainer.use_tpu:
xm.optimizer_step(optimizer)
else:
optimizer.step()
optimizer.zero_grad()
self.lr_scheduler.step()
The use_tpu is causing an issue as well, so i commented it
def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure=None):
optimizer.step()
optimizer.zero_grad()
self.lr_scheduler.step()
@kisejin Thanks for sharing this with us.
However, I am not reproducing the performance of the paper.
Based on the res15 dataset, 200 runs,
I have seen performance between 0.42XX and 0.46XX.
Is anyone else seeing a significant performance difference?
@ssoyaavv are you testing this on a CPU device?
@LawrenceMoruye I used a GPU for testing.