PPY-66: `MetadataInconsistency` throws a fatal error in `sync` mode | Neptune client crashes when uploading checkpoints if they rotate faster than uploads

Question

PPY-66: `MetadataInconsistency` throws a fatal error in `sync` mode | Neptune client crashes when uploading checkpoints if they rotate faster than uploads

Opened this issue a month ago · 5 comments

Describe the bug

I'm fairly reliably seeing the following type of exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 297, in run
    super().run()
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/threading/daemon.py", line 96, in run
    self.work()
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 315, in work
    self.process_batch([element.obj for element in batch], batch[-1].ver, batch[-1].at)
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/threading/daemon.py", line 121, in wrapper
    result = func(self_, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 339, in process_batch
    processed_count, errors = self._processor._backend.execute_operations(
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 498, in execute_operations
    self._execute_upload_operations_with_400_retry(
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 605, in _execute_upload_operations_with_400_retry
    return self._execute_upload_operations(
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 559, in _execute_upload_operations
    upload_errors = upload_file_attribute(
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_file_operations.py", line 119, in upload_file_attribute
    _multichunk_upload_with_retry(
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_file_operations.py", line 290, in _multichunk_upload_with_retry
    return _multichunk_upload(upload_entry, swagger_client, query_params, multipart_config, urlset)
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_file_operations.py", line 335, in _multichunk_upload
    for idx, chunk in enumerate(chunker.generate()):
  File "/opt/conda/lib/python3.10/site-packages/neptune/common/storage/datastream.py", line 72, in generate
    if last_change and last_change < os.stat(self._filename).st_mtime:
FileNotFoundError: [Errno 2] No such file or directory: '/app/.neptune/fiducial-training/FIDS-650/checkpoints/epoch=9-step=100.ckpt'

My training loop takes about 15s per epoch, and is set to check validation every 10 epochs, and validation also performs model checkpointing if the validation metric improves. At the start of training every validation step improves, so I churn through a lot of checkpoints. Each checkpoint is about 500MB, and I have it set to save the top 3.

I've also noticed that checkpoints are initially written as .bin files, and then seen to get renamed to .ckpt files - perhaps this renaming is happening during upload and causing issues?

I'm using PyTorch Lightning and the built in ModelCheckpoint. I'm using the built-in NeptuneLogger integration.

Reproduction

Unfortunately I don't have a shareable reproduction. I can fairly reliably reproduce this in production though, though it seems network depend as yesterday the same code didn't have this problem, but I'm constantly hitting it this morning.

It's fairly severe, as it means I completely lose any real time metrics.

Expected behavior

All checkpoints would upload and rotate as expected.

Traceback

See above.

Environment

I can't get this right now, please shout if you think you need to know this!

Answer 1 · 2024-11-29T11:50:33.000Z

Hey @ocharles 👋
Thanks for reporting this.

While we investigate, I can suggest some workarounds:

Since this seems to be happening because the models are deleted by the ModelCheckpoint() callback before Neptune has had a chance to upload them, you can:
a. Upload all checkpoints instead of using save_top_k, OR
b. Pass mode="sync" when creating NeptuneLogger() (or directly to the init_run() method if logging to an existing run). This will log to Neptune in synchronous mode, guaranteeing that the checkpoints have been uploaded to Neptune before proceeding. However, this can significantly increase runtime.
If you don't want to upload checkpoints to Neptune, you can turn this off by passing log_model_checkpoints=False when creating NeptuneLogger().

Would any of these be an acceptable workaround for now?

Answer 2 · 2024-11-29T11:54:00.000Z

Oh, mode=sync sounds good, I missed that! I would like to keep checkpoints uploaded as Neptune makes it so convenient and it's really nice to have all the information! I guess the ideal behaviour would be that only checkpoints are uploaded synchronously, but I can probably live with a minor hit at the end of validation or whenever I log training steps.

Thanks, I'll try this when I'm back at my computer next week. I really appreciate the (incredibly) speedy response!

Answer 3 · 2024-11-29T11:59:25.000Z

I guess the ideal behaviour would be that only checkpoints are uploaded synchronously

Most of the sync lag will be introduced due to uploading checkpoints. So there shouldn't be a noticeable different between syncing the entire state of the run or just checkpoints.

Please let me know if using mode="sync" works for you.

I'll be at a company offsite next week and then at NeurIPS, but will look into this once back 📝

Answer 4 · 2024-12-01T20:31:05.000Z

I think this might now be causing what was previously a warning to be a fatal exception:

neptune.exceptions.MetadataInconsistency: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 49.0

I know this is known (#733), but I've never seen this cause a crash. Here's the stack trace:

Traceback (most recent call last):
  File "/app/run_pipeline.py", line 79, in <module>
    main()
  File "/app/run_pipeline.py", line 59, in main
    checkpoint_path = train_model(
  File "/app/train.py", line 243, in train_model
    trainer.fit(model, data_module)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 206, in run
    self.on_advance_end()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 392, in on_advance_end
    trainer._logger_connector.update_train_epoch_metrics()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 168, in update_train_epoch_metrics
    self.log_metrics(self.metrics["log"])
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 117, in log_metrics
    logger.log_metrics(metrics=scalar_metrics, step=step)
  File "/opt/conda/lib/python3.10/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loggers/neptune.py", line 60, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loggers/neptune.py", line 460, in log_metrics
    self.run[key].append(val, step=step)
  File "/opt/conda/lib/python3.10/site-packages/neptune/handler.py", line 88, in inner_fun
    return fun(self, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/neptune/handler.py", line 404, in append
    self.extend(value, steps=step, timestamps=timestamp, wait=wait, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/neptune/handler.py", line 88, in inner_fun
    return fun(self, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/neptune/handler.py", line 459, in extend
    attr.extend(values, steps=steps, timestamps=timestamps, wait=wait, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/neptune/attributes/series/series.py", line 162, in extend
    self._enqueue_operation(op, wait=wait)
  File "/opt/conda/lib/python3.10/site-packages/neptune/attributes/attribute.py", line 45, in _enqueue_operation
    self._container._op_processor.enqueue_operation(operation, wait=wait)
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/utils/disk_utilization.py", line 171, in wrapper
    error_handler(max_disk_utilization, func, *args, **kwargs).run()
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/utils/disk_utilization.py", line 102, in run
    return self.handle_limit_not_set()
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/utils/disk_utilization.py", line 146, in handle_limit_not_set
    return self.func(*self.args, **self.kwargs)
  File "/opt/conda/lib/python3.10/site-packages/neptune/internal/operation_processors/sync_operation_processor.py", line 82, in enqueue_operation
    raise errors[0]
neptune.exceptions.MetadataInconsistency: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 49.0

Answer 5 · 2024-12-02T10:33:52.000Z

Thanks for bringing this issue to our notice. It should be a warning and not a fatal error. I was able to reproduce it and have created a ticket for the team to look into it.

Meanwhile, would you be open to slightly changing the Lightning source code to handle this error temporarily?
In the lightning > pytorch > loggers > neptune.py file, update the log_metrics() function to as follows:

      from neptune.exceptions import MetadataInconsistency
      for key, val in metrics.items():
          try:
              self.run[key].append(val, step=step)
          except MetadataInconsistency as e:
              log.warning(f"Failed to log metric {key} with value {val} at step {step} due to: {e}")

We are just wrapping the append() call within a try...except block. This should handle the exception thrown.

Would this work as a band-aid?