'LengthBatcher' object has no attribute 'sample_order'
rish-16 opened this issue · 2 comments
Hey Jason and Team, thanks for the amazing repo!!
I tried to retrain on SCOPe on my setup (2 RTX3090s) and am running into this issue attached below that's causing the training to stop and crash. I also tried it with 1 GPU and it still crashed the same way.
To reproduce: python train_se3_flows.py
(I reorganised the files a bit to make it cleaner/more manageable)
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/rishabh/protein-frame-flow/train_se3_flows.py", line 97, in main
exp.train()
File "/home/rishabh/protein-frame-flow/train_se3_flows.py", line 72, in train
trainer.fit(
File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, **kwargs)
File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
self.fit_loop.run()
File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 194, in run
self.setup_data()
File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 250, in setup_data
length = len(dl) if has_len_all_ranks(dl, trainer.strategy, allow_zero_length) else float("inf")
File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py", line 97, in has_len_all_ranks
local_length = sized_len(dataloader)
File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/lightning_fabric/utilities/data.py", line 51, in sized_len
length = len(dataloader) # type: ignore [arg-type]
File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 483, in __len__
return len(self._index_sampler)
File "/home/rishabh/protein-frame-flow/src/data/pdb_dataloader.py", line 250, in __len__
return len(self.sample_order)
AttributeError: 'LengthBatcher' object has no attribute 'sample_order'
I've narrowed down the issue to this line: https://github.com/microsoft/protein-frame-flow/blob/main/data/pdb_dataloader.py#L245
My guess is that the self._create_batches()
method in L245 isn't really being called in the __iter__(...)
method; tried printing the sample_order
variable and nothing was printed (so that line isn't run at all). Do you think it's a PyTorch / Lightning issue?
I've been trying to find workarounds for a while but nothing has worked yet. Appreciate any leads on this :)
That's odd. Are you using the same version of lightning as what's in the fm.yml
? I wonder if lightning changed something so it calles len before it calls iter.
Hi, closing this for now but please reopen if you are stil running into problems.