Performance bug for multi-field configuration in train_continue and num_loader_workers >= 2
Closed this issue · 0 comments
iluise commented
Opening a performance bug for the following error when training the multiformer on large number of nodes (in this case 32) and num_loader_workers
>= 2:
19: warnings.warn(_create_warning_msg(
19: Traceback (most recent call last):
19: File "/p/project/atmo-rep/ilaria/atmorep/atmorep_github/atmorep/atmorep/core/train_multi12h.py", line 239, in <module>
19: train_continue( model_id, model_epoch, Trainer, model_epoch_continue)
19: File "/p/project/atmo-rep/ilaria/atmorep/atmorep_github/atmorep/atmorep/core/train_multi12h.py", line 68, in train_continue
19: trainer.run( model_epoch_continue)
19: File "/p/project/atmo-rep/ilaria/atmorep/atmorep_github/atmorep/atmorep/core/trainer.py", line 206, in run
19: self.train( epoch)
19: File "/p/project/atmo-rep/ilaria/atmorep/atmorep_github/atmorep/atmorep/core/trainer.py", line 234, in train
19: model.mode( NetMode.train)
19: File "/p/project/atmo-rep/ilaria/atmorep/atmorep_github/atmorep/atmorep/core/atmorep_model.py", line 167, in mode
19: self.data_loader_iter = iter(self.data_loader_train)
19: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 438, in __iter__
19: return self._get_iterator()
19: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/datalo
19: ader.py", line 386, in _get_iterator
19: return _MultiProcessingDataLoaderIter(self)
19: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1039, in __init__
19: w.start()
19: File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
19: self._popen = self._Popen(self)
19: File "/usr/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
19: return _default_context.get_context().Process._Popen(process_obj)
19: File "/usr/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
19: return Popen(process_obj)
19: File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
19: self._launch(process_obj)
19: File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 66, in _launch
19: self.pid = os.fork()
19: OSError: [Errno 12] Cannot allocate memory