microsoft/protein-frame-flow

'LengthBatcher' object has no attribute 'sample_order'

rish-16 opened this issue · 2 comments

Hey Jason and Team, thanks for the amazing repo!!

I tried to retrain on SCOPe on my setup (2 RTX3090s) and am running into this issue attached below that's causing the training to stop and crash. I also tried it with 1 GPU and it still crashed the same way.

To reproduce: python train_se3_flows.py (I reorganised the files a bit to make it cleaner/more manageable)

Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/rishabh/protein-frame-flow/train_se3_flows.py", line 97, in main
    exp.train()
  File "/home/rishabh/protein-frame-flow/train_se3_flows.py", line 72, in train
    trainer.fit(
  File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
  File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 194, in run
    self.setup_data()
  File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 250, in setup_data
    length = len(dl) if has_len_all_ranks(dl, trainer.strategy, allow_zero_length) else float("inf")
  File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py", line 97, in has_len_all_ranks
    local_length = sized_len(dataloader)
  File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/lightning_fabric/utilities/data.py", line 51, in sized_len
    length = len(dataloader)  # type: ignore [arg-type]
  File "/home/rishabh/miniconda3/envs/fyp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 483, in __len__
    return len(self._index_sampler)
  File "/home/rishabh/protein-frame-flow/src/data/pdb_dataloader.py", line 250, in __len__
    return len(self.sample_order)
AttributeError: 'LengthBatcher' object has no attribute 'sample_order'

I've narrowed down the issue to this line: https://github.com/microsoft/protein-frame-flow/blob/main/data/pdb_dataloader.py#L245

My guess is that the self._create_batches() method in L245 isn't really being called in the __iter__(...) method; tried printing the sample_order variable and nothing was printed (so that line isn't run at all). Do you think it's a PyTorch / Lightning issue?

I've been trying to find workarounds for a while but nothing has worked yet. Appreciate any leads on this :)

That's odd. Are you using the same version of lightning as what's in the fm.yml? I wonder if lightning changed something so it calles len before it calls iter.

Hi, closing this for now but please reopen if you are stil running into problems.