ZD766: I/O operation on closed file.
mrembalski opened this issue · 9 comments
I'm using Pytorch Lightning with Neptune Logger. When Neptune (in async mode) crashes, SLURM actually makes all the training stop, since the process just stops responding. Due to that, 5 out of 30 runs actually got stopped in our case, which makes it really unreliable.
The crash:
Original exception was:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 41, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 91, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 568, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 973, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1016, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
self.advance()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 354, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
self.advance(data_fetcher)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 239, in advance
trainer._logger_connector.update_train_step_metrics()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 154, in update_train_step_metrics
self.log_metrics(self.metrics["log"])
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 108, in log_metrics
logger.log_metrics(metrics=scalar_metrics, step=step)
File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/rank_zero.py", line 32, in wrapped_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loggers/neptune.py", line 443, in log_metrics
self.run[key].append(val)
File "/usr/local/lib/python3.10/site-packages/neptune/handler.py", line 86, in inner_fun
return fun(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neptune/handler.py", line 392, in append
self.extend(value, steps=step, timestamps=timestamp, wait=wait, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neptune/handler.py", line 86, in inner_fun
return fun(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neptune/handler.py", line 447, in extend
attr.extend(values, steps=steps, timestamps=timestamps, wait=wait, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neptune/attributes/series/series.py", line 162, in extend
self._enqueue_operation(op, wait=wait)
File "/usr/local/lib/python3.10/site-packages/neptune/attributes/attribute.py", line 45, in _enqueue_operation
self._container._op_processor.enqueue_operation(operation, wait=wait)
File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 93, in enqueue_operation
self._last_version = self._queue.put(op)
File "/usr/local/lib/python3.10/site-packages/neptune/internal/disk_queue.py", line 97, in put
self._writer = open(self._get_log_file(version), "a")
FileNotFoundError: [Errno 2] No such file or directory: '/experiment/.neptune/async/run__9a037a06-0890-47d1-9b5b-6c942a9411f8/exec-1693397227.992207-2023-08-30_14.07.07.992207-1544853/data-407164.log'
Environment
neptune 1.4.1
pytorch-lightning 2.0.6
torch 2.1.0.dev20230805+cu121
The operating system you're using:
The output of python --version
:
Singularity> python --version
Python 3.10.10
Add any other context about the problem here.
Hello @mrembalski 👋
I see that you are using neptune==1.4.1
. We have made some improvements in 1.5.0 on how we handle the cleanup of async
artifacts that might cause the FileNotFoundError
in previous versions.
Could you please update your neptune
to the latest and let me know if you still face this issue? 🙏
pip install -U neptune
I am also facing the same behavior wherein, some of the ddp ranks alone have their ranks killed. @SiddhantSadangi I would like to better understand what kind of fix is done in 1.5.0. Was it like we have fixed a wrong cleanup of a file which is actually required for neptune async logging ?
Hi @SiddhantSadangi is there any update on this issue ? This is continuously killing our training jobs atleast once in 2 days. Being a paying customer we expect a much better response than this. Please help at the earliest.
I am attaching the relevant log files and also images to show that I have indeed upgraded to 1.5.0 version as recommended, yet we do face the same issue repeatedly.
Hello @harishankar-gopalan
I am so sorry I missed your query. The fixes in 1.5.0 were related to how we handle forking. However, since you are still facing the issue with 1.5.0, I will escalate this to engineering right away.
Could you please send us the below details to support@neptune.ai?
1. Reproduction
In as much detail as possible, please provide steps to reproduce the issue or code snippet.
2. Traceback
The complete error traceback and log output/screenshots to help explain your problem.
3. Environment info
- The output of
pip list
: - The operating system you're using:
- The output of
python --version
:
4. Additional context
Add any other context about the problem here.
Hi @SiddhantSadangi I have sent the same now from my corporate mail ID.
Thanks for sharing the details, and apologies for the delay here 🙏
I've already escalated this to the engineering team.
Hey @mrembalski , @harishankar-gopalan ,
We have prepared a pre-release version, which addresses a race condition that could have led to this issue.
Could you please update your client to 1.7.0rc1 and let me know if this works?
pip install neptune==1.7.0rc1
Hi @SiddhantSadangi,
I have not encountered the issue since updating to 1.6.1. If I encounter it again, I'll post it here.
Thanks for letting us know @mrembalski .
Since I already have an email thread open with @harishankar-gopalan , I'll close this issue thread ✅