ModelCheckpoint tries to remove already removed checkpoint in DDP mode

Question

ModelCheckpoint tries to remove already removed checkpoint in DDP mode

Closed this issue 4 years ago · 11 comments

🐛 Bug

When training in DDP mode with ModelCheckpoint callback, the train process fails, when ModelCheckpoint callback tries to remove previous checkpoint. I assume that it was already deleted by another process.

To Reproduce

Steps to reproduce the behavior:

Run training with "ddp" backend and ModelCheckpoint callback with save_top_k={some_number}

  File "/home/myuser/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap                                                                                                 
    fn(i, *args)                                                                                
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 342, in ddp_train
    self.run_pretrain_routine(model)                                                                                                        
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()                                                                                                                                                                                                   
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train                                                      
    self.run_training_epoch()                            
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch                                                                       
    self.call_checkpoint_callback()                                                                                                                                                   
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 737, in call_checkpoint_callback
    self.checkpoint_callback.on_validation_end(self, self.get_model())                                                                                                                                             
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end                                      
    self._do_check_save(filepath, current, epoch)    
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 221, in _do_check_save                                                                      
    self._del_model(delpath)                                                                                                                                                           
  File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 121, in _del_model
    os.remove(filepath)                                                                                                                                                                                            
FileNotFoundError: [Errno 2] No such file or directory: {PREVIOUS_CHECKPOINT_NAME}

Expected behavior

I expect that ModelCheckpoint callbacks from different DDP processes will not concurrent with each other in saving/deleteng files.

I fixed by rewriting _del_model method of ModelCheckpoint callback:

class DDPModelCheckpoint(ModelCheckpoint):
    def _del_model(self, filepath):
        try:
            os.remove(filepath)
        except Exception:
            pass

Environment

PyTorch Version: 1.4
OS: Ubuntu 18.04
How you installed PyTorch - conda
Python version: 3.7
CUDA/cuDNN version: 10.2
GPU models and configuration: 2x2080Ti
pytorch-lightning version: 0.7.1

Answer 1 · 2020-04-03T16:02:40.000Z

Hi! thanks for your contribution!, great first issue!

Answer 2 · 2020-04-04T08:16:38.000Z

Your suggestion to pass on an Exception is not the best, at least you should make it the specific error, i.e., FileNotFoundError. But in this case, I suggest to do simply

if os.path.isfile(filepath):
    os.remove(filepath)

Answer 3 · 2020-04-04T08:18:29.000Z

Is there also an issue with saving? Does it save/overwrite the file in multiple processes?

Answer 4 · 2020-04-06T07:34:57.000Z

I encountered that one too. From my perspective, model updates should be happening within main worker only (master worker). However, I guess each workers lightning created is trying to delete their own checkpoints. However, slave worker never created one (and it shouldn't be). I solved the problem in a similar way with @belskikh 's workaround but it did not feel right and downgraded to 0.6.0.

Answer 5 · 2020-04-06T09:47:20.000Z

@awaelchli of course it is not the best (it may be the worse, actually) solution
It is just a fast workaround, waiting for solid fix

I agree with logic, when only one ModelCheckpoint callback should save/delete weights, because all weights are the same on all nodes at the end of the training step.
It can be done somehow like this:

class ModelCheckpoint(..., main_worker_rank: int = 0):
....
    def _del_model(self, filepath):
        if self.main_worker_rank == dist.get_rank():
             # do delete

And the same for the saving code.

Answer 6 · 2020-04-06T10:00:02.000Z

@williamFalcon should Lightning do checkpoints only on rank 0? It could be a problem if writing to a shared filesystem between nodes. AFAIK the loggers already do that by only logging in process 0.

Answer 7 · 2020-04-06T11:20:09.000Z

I do not think, that it should do it only on specific rank, I think user should have ability to specify rank (node), where checkpoints will be saved

Answer 8 · 2020-04-07T19:23:25.000Z

Your suggestion to pass on an Exception is not the best, at least you should make it the specific error, i.e., FileNotFoundError. But in this case, I suggest to do simply
if os.path.isfile(filepath):
    os.remove(filepath)

this does not work in async, I have observed many times that the if pass for both but then when you try really delete it, it is missing for one of them...

Answer 9 · 2020-04-07T19:28:48.000Z

should checkpointing be done in only one process then, like loggers?

Answer 10 · 2020-04-07T20:11:15.000Z

yup. checkpoint should only happen from world_rank = 0 gpu 0

Answer 11 · 2020-04-07T21:01:26.000Z

well, ModelCeckpoint doesn't have a rank...