Lightning-AI/pytorch-lightning

ModelCheckpoint tries to remove already removed checkpoint in DDP mode

Closed this issue ยท 11 comments

๐Ÿ› Bug

When training in DDP mode with ModelCheckpoint callback, the train process fails, when ModelCheckpoint callback tries to remove previous checkpoint. I assume that it was already deleted by another process.

To Reproduce

Steps to reproduce the behavior:

Run training with "ddp" backend and ModelCheckpoint callback with save_top_k={some_number}

  File "/home/myuser/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap                                                                                                 
    fn(i, *args)                                                                                
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 342, in ddp_train
    self.run_pretrain_routine(model)                                                                                                        
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()                                                                                                                                                                                                   
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train                                                      
    self.run_training_epoch()                            
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch                                                                       
    self.call_checkpoint_callback()                                                                                                                                                   
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 737, in call_checkpoint_callback
    self.checkpoint_callback.on_validation_end(self, self.get_model())                                                                                                                                             
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end                                      
    self._do_check_save(filepath, current, epoch)    
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 221, in _do_check_save                                                                      
    self._del_model(delpath)                                                                                                                                                           
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 121, in _del_model
    os.remove(filepath)                                                                                                                                                                                            
FileNotFoundError: [Errno 2] No such file or directory: {PREVIOUS_CHECKPOINT_NAME}

Expected behavior

I expect that ModelCheckpoint callbacks from different DDP processes will not concurrent with each other in saving/deleteng files.

I fixed by rewriting _del_model method of ModelCheckpoint callback:

class DDPModelCheckpoint(ModelCheckpoint):
    def _del_model(self, filepath):
        try:
            os.remove(filepath)
        except Exception:
            pass

Environment

  • PyTorch Version: 1.4
  • OS: Ubuntu 18.04
  • How you installed PyTorch - conda
  • Python version: 3.7
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration: 2x2080Ti
  • pytorch-lightning version: 0.7.1

Hi! thanks for your contribution!, great first issue!

Your suggestion to pass on an Exception is not the best, at least you should make it the specific error, i.e., FileNotFoundError. But in this case, I suggest to do simply

if os.path.isfile(filepath):
    os.remove(filepath)

Is there also an issue with saving? Does it save/overwrite the file in multiple processes?

I encountered that one too. From my perspective, model updates should be happening within main worker only (master worker). However, I guess each workers lightning created is trying to delete their own checkpoints. However, slave worker never created one (and it shouldn't be). I solved the problem in a similar way with @belskikh 's workaround but it did not feel right and downgraded to 0.6.0.

@awaelchli of course it is not the best (it may be the worse, actually) solution
It is just a fast workaround, waiting for solid fix

I agree with logic, when only one ModelCheckpoint callback should save/delete weights, because all weights are the same on all nodes at the end of the training step.
It can be done somehow like this:

class ModelCheckpoint(..., main_worker_rank: int = 0):
....
    def _del_model(self, filepath):
        if self.main_worker_rank == dist.get_rank():
             # do delete

And the same for the saving code.

@williamFalcon should Lightning do checkpoints only on rank 0? It could be a problem if writing to a shared filesystem between nodes. AFAIK the loggers already do that by only logging in process 0.

I do not think, that it should do it only on specific rank, I think user should have ability to specify rank (node), where checkpoints will be saved

Borda commented

Your suggestion to pass on an Exception is not the best, at least you should make it the specific error, i.e., FileNotFoundError. But in this case, I suggest to do simply

if os.path.isfile(filepath):
    os.remove(filepath)

this does not work in async, I have observed many times that the if pass for both but then when you try really delete it, it is missing for one of them...

should checkpointing be done in only one process then, like loggers?

yup. checkpoint should only happen from world_rank = 0 gpu 0

Borda commented

well, ModelCeckpoint doesn't have a rank...