ZD766: I/O operation on closed file.

Question

ZD766: I/O operation on closed file.

mrembalski opened this issue a year ago · 9 comments

I'm using Pytorch Lightning with Neptune Logger. When Neptune (in async mode) crashes, SLURM actually makes all the training stop, since the process just stops responding. Due to that, 5 out of 30 runs actually got stopped in our case, which makes it really unreliable.

The crash:

Original exception was:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 41, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 91, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 568, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 973, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1016, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 239, in advance
    trainer._logger_connector.update_train_step_metrics()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 154, in update_train_step_metrics
    self.log_metrics(self.metrics["log"])
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 108, in log_metrics
    logger.log_metrics(metrics=scalar_metrics, step=step)
  File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loggers/neptune.py", line 443, in log_metrics
    self.run[key].append(val)
  File "/usr/local/lib/python3.10/site-packages/neptune/handler.py", line 86, in inner_fun
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/neptune/handler.py", line 392, in append
    self.extend(value, steps=step, timestamps=timestamp, wait=wait, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/neptune/handler.py", line 86, in inner_fun
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/neptune/handler.py", line 447, in extend
    attr.extend(values, steps=steps, timestamps=timestamps, wait=wait, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/neptune/attributes/series/series.py", line 162, in extend
    self._enqueue_operation(op, wait=wait)
  File "/usr/local/lib/python3.10/site-packages/neptune/attributes/attribute.py", line 45, in _enqueue_operation
    self._container._op_processor.enqueue_operation(operation, wait=wait)
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 93, in enqueue_operation
    self._last_version = self._queue.put(op)
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/disk_queue.py", line 97, in put
    self._writer = open(self._get_log_file(version), "a")
FileNotFoundError: [Errno 2] No such file or directory: '/experiment/.neptune/async/run__9a037a06-0890-47d1-9b5b-6c942a9411f8/exec-1693397227.992207-2023-08-30_14.07.07.992207-1544853/data-407164.log'

Environment

neptune                        1.4.1
pytorch-lightning         2.0.6
torch                             2.1.0.dev20230805+cu121

The operating system you're using:
The output of python --version:

Singularity> python --version
Python 3.10.10

Add any other context about the problem here.

Answer 1 · 2023-08-31T09:55:32.000Z

Hello @mrembalski 👋

I see that you are using neptune==1.4.1. We have made some improvements in 1.5.0 on how we handle the cleanup of async artifacts that might cause the FileNotFoundError in previous versions.

Could you please update your neptune to the latest and let me know if you still face this issue? 🙏
pip install -U neptune

Answer 2 · 2023-09-01T03:48:20.000Z

I am also facing the same behavior wherein, some of the ddp ranks alone have their ranks killed. @SiddhantSadangi I would like to better understand what kind of fix is done in 1.5.0. Was it like we have fixed a wrong cleanup of a file which is actually required for neptune async logging ?

Answer 3 · 2023-09-12T05:01:35.000Z

Hi @SiddhantSadangi is there any update on this issue ? This is continuously killing our training jobs atleast once in 2 days. Being a paying customer we expect a much better response than this. Please help at the earliest.

I am attaching the relevant log files and also images to show that I have indeed upgraded to 1.5.0 version as recommended, yet we do face the same issue repeatedly.

stderr.log

Answer 4 · 2023-09-12T07:48:53.000Z

Hello @harishankar-gopalan

I am so sorry I missed your query. The fixes in 1.5.0 were related to how we handle forking. However, since you are still facing the issue with 1.5.0, I will escalate this to engineering right away.

Could you please send us the below details to support@neptune.ai?

1. Reproduction
In as much detail as possible, please provide steps to reproduce the issue or code snippet.

2. Traceback
The complete error traceback and log output/screenshots to help explain your problem.

3. Environment info

The output of pip list:
The operating system you're using:
The output of python --version:

4. Additional context
Add any other context about the problem here.

Answer 5 · 2023-09-12T09:35:43.000Z

Hi @SiddhantSadangi I have sent the same now from my corporate mail ID.

Answer 6 · 2023-09-12T09:42:33.000Z

Thanks for sharing the details, and apologies for the delay here 🙏
I've already escalated this to the engineering team.

Answer 7 · 2023-09-19T09:04:51.000Z

Hey @mrembalski , @harishankar-gopalan ,

We have prepared a pre-release version, which addresses a race condition that could have led to this issue.

Could you please update your client to 1.7.0rc1 and let me know if this works?
pip install neptune==1.7.0rc1

Answer 8 · 2023-09-19T10:00:12.000Z

Hi @SiddhantSadangi,
I have not encountered the issue since updating to 1.6.1. If I encounter it again, I'll post it here.

Answer 9 · 2023-09-19T10:16:20.000Z

Thanks for letting us know @mrembalski .
Since I already have an email thread open with @harishankar-gopalan , I'll close this issue thread ✅