ModelCheckpoint tries to remove already removed checkpoint in DDP mode
Closed this issue ยท 11 comments
๐ Bug
When training in DDP mode with ModelCheckpoint callback, the train process fails, when ModelCheckpoint callback tries to remove previous checkpoint. I assume that it was already deleted by another process.
To Reproduce
Steps to reproduce the behavior:
Run training with "ddp"
backend and ModelCheckpoint
callback with save_top_k={some_number}
File "/home/myuser/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 342, in ddp_train
self.run_pretrain_routine(model)
File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
self.train()
File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
self.run_training_epoch()
File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
self.call_checkpoint_callback()
File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 737, in call_checkpoint_callback
self.checkpoint_callback.on_validation_end(self, self.get_model())
File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
self._do_check_save(filepath, current, epoch)
File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 221, in _do_check_save
self._del_model(delpath)
File "/home/myuser/miniconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 121, in _del_model
os.remove(filepath)
FileNotFoundError: [Errno 2] No such file or directory: {PREVIOUS_CHECKPOINT_NAME}
Expected behavior
I expect that ModelCheckpoint callbacks from different DDP processes will not concurrent with each other in saving/deleteng files.
I fixed by rewriting _del_model
method of ModelCheckpoint callback:
class DDPModelCheckpoint(ModelCheckpoint):
def _del_model(self, filepath):
try:
os.remove(filepath)
except Exception:
pass
Environment
- PyTorch Version: 1.4
- OS: Ubuntu 18.04
- How you installed PyTorch -
conda
- Python version: 3.7
- CUDA/cuDNN version: 10.2
- GPU models and configuration: 2x2080Ti
- pytorch-lightning version: 0.7.1
Hi! thanks for your contribution!, great first issue!
Your suggestion to pass on an Exception is not the best, at least you should make it the specific error, i.e., FileNotFoundError. But in this case, I suggest to do simply
if os.path.isfile(filepath):
os.remove(filepath)
Is there also an issue with saving? Does it save/overwrite the file in multiple processes?
I encountered that one too. From my perspective, model updates should be happening within main worker only (master worker). However, I guess each workers lightning created is trying to delete their own checkpoints. However, slave worker never created one (and it shouldn't be). I solved the problem in a similar way with @belskikh 's workaround but it did not feel right and downgraded to 0.6.0.
@awaelchli of course it is not the best (it may be the worse, actually) solution
It is just a fast workaround, waiting for solid fix
I agree with logic, when only one ModelCheckpoint callback should save/delete weights, because all weights are the same on all nodes at the end of the training step.
It can be done somehow like this:
class ModelCheckpoint(..., main_worker_rank: int = 0):
....
def _del_model(self, filepath):
if self.main_worker_rank == dist.get_rank():
# do delete
And the same for the saving code.
@williamFalcon should Lightning do checkpoints only on rank 0? It could be a problem if writing to a shared filesystem between nodes. AFAIK the loggers already do that by only logging in process 0.
I do not think, that it should do it only on specific rank, I think user should have ability to specify rank (node), where checkpoints will be saved
Your suggestion to pass on an Exception is not the best, at least you should make it the specific error, i.e., FileNotFoundError. But in this case, I suggest to do simply
if os.path.isfile(filepath): os.remove(filepath)
this does not work in async, I have observed many times that the if pass for both but then when you try really delete it, it is missing for one of them...
should checkpointing be done in only one process then, like loggers?
yup. checkpoint should only happen from world_rank = 0 gpu 0
well, ModelCeckpoint doesn't have a rank...