PPY-66: `MetadataInconsistency` throws a fatal error in `sync` mode | Neptune client crashes when uploading checkpoints if they rotate faster than uploads
Opened this issue ยท 5 comments
Describe the bug
I'm fairly reliably seeing the following type of exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 297, in run
super().run()
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/threading/daemon.py", line 96, in run
self.work()
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 315, in work
self.process_batch([element.obj for element in batch], batch[-1].ver, batch[-1].at)
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/threading/daemon.py", line 121, in wrapper
result = func(self_, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 339, in process_batch
processed_count, errors = self._processor._backend.execute_operations(
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 498, in execute_operations
self._execute_upload_operations_with_400_retry(
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 605, in _execute_upload_operations_with_400_retry
return self._execute_upload_operations(
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 559, in _execute_upload_operations
upload_errors = upload_file_attribute(
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_file_operations.py", line 119, in upload_file_attribute
_multichunk_upload_with_retry(
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_file_operations.py", line 290, in _multichunk_upload_with_retry
return _multichunk_upload(upload_entry, swagger_client, query_params, multipart_config, urlset)
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/backends/hosted_file_operations.py", line 335, in _multichunk_upload
for idx, chunk in enumerate(chunker.generate()):
File "/opt/conda/lib/python3.10/site-packages/neptune/common/storage/datastream.py", line 72, in generate
if last_change and last_change < os.stat(self._filename).st_mtime:
FileNotFoundError: [Errno 2] No such file or directory: '/app/.neptune/fiducial-training/FIDS-650/checkpoints/epoch=9-step=100.ckpt'
My training loop takes about 15s per epoch, and is set to check validation every 10 epochs, and validation also performs model checkpointing if the validation metric improves. At the start of training every validation step improves, so I churn through a lot of checkpoints. Each checkpoint is about 500MB, and I have it set to save the top 3.
I've also noticed that checkpoints are initially written as .bin files, and then seen to get renamed to .ckpt files - perhaps this renaming is happening during upload and causing issues?
I'm using PyTorch Lightning and the built in ModelCheckpoint. I'm using the built-in NeptuneLogger integration.
Reproduction
Unfortunately I don't have a shareable reproduction. I can fairly reliably reproduce this in production though, though it seems network depend as yesterday the same code didn't have this problem, but I'm constantly hitting it this morning.
It's fairly severe, as it means I completely lose any real time metrics.
Expected behavior
All checkpoints would upload and rotate as expected.
Traceback
See above.
Environment
I can't get this right now, please shout if you think you need to know this!
Hey @ocharles ๐
Thanks for reporting this.
While we investigate, I can suggest some workarounds:
-
Since this seems to be happening because the models are deleted by the
ModelCheckpoint()
callback before Neptune has had a chance to upload them, you can:
a. Upload all checkpoints instead of usingsave_top_k
, OR
b. Passmode="sync"
when creatingNeptuneLogger()
(or directly to theinit_run()
method if logging to an existing run). This will log to Neptune in synchronous mode, guaranteeing that the checkpoints have been uploaded to Neptune before proceeding. However, this can significantly increase runtime. -
If you don't want to upload checkpoints to Neptune, you can turn this off by passing
log_model_checkpoints=False
when creatingNeptuneLogger()
.
Would any of these be an acceptable workaround for now?
Oh, mode=sync sounds good, I missed that! I would like to keep checkpoints uploaded as Neptune makes it so convenient and it's really nice to have all the information! I guess the ideal behaviour would be that only checkpoints are uploaded synchronously, but I can probably live with a minor hit at the end of validation or whenever I log training steps.
Thanks, I'll try this when I'm back at my computer next week. I really appreciate the (incredibly) speedy response!
I guess the ideal behaviour would be that only checkpoints are uploaded synchronously
Most of the sync lag will be introduced due to uploading checkpoints. So there shouldn't be a noticeable different between syncing the entire state of the run or just checkpoints.
Please let me know if using mode="sync"
works for you.
I'll be at a company offsite next week and then at NeurIPS, but will look into this once back ๐
I think this might now be causing what was previously a warning to be a fatal exception:
neptune.exceptions.MetadataInconsistency: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 49.0
I know this is known (#733), but I've never seen this cause a crash. Here's the stack trace:
Traceback (most recent call last):
File "/app/run_pipeline.py", line 79, in <module>
main()
File "/app/run_pipeline.py", line 59, in main
checkpoint_path = train_model(
File "/app/train.py", line 243, in train_model
trainer.fit(model, data_module)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit
call._call_and_handle_interrupt(
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
results = self._run_stage()
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage
self.fit_loop.run()
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 206, in run
self.on_advance_end()
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 392, in on_advance_end
trainer._logger_connector.update_train_epoch_metrics()
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 168, in update_train_epoch_metrics
self.log_metrics(self.metrics["log"])
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 117, in log_metrics
logger.log_metrics(metrics=scalar_metrics, step=step)
File "/opt/conda/lib/python3.10/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loggers/neptune.py", line 60, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loggers/neptune.py", line 460, in log_metrics
self.run[key].append(val, step=step)
File "/opt/conda/lib/python3.10/site-packages/neptune/handler.py", line 88, in inner_fun
return fun(self, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/neptune/handler.py", line 404, in append
self.extend(value, steps=step, timestamps=timestamp, wait=wait, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/neptune/handler.py", line 88, in inner_fun
return fun(self, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/neptune/handler.py", line 459, in extend
attr.extend(values, steps=steps, timestamps=timestamps, wait=wait, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/neptune/attributes/series/series.py", line 162, in extend
self._enqueue_operation(op, wait=wait)
File "/opt/conda/lib/python3.10/site-packages/neptune/attributes/attribute.py", line 45, in _enqueue_operation
self._container._op_processor.enqueue_operation(operation, wait=wait)
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/utils/disk_utilization.py", line 171, in wrapper
error_handler(max_disk_utilization, func, *args, **kwargs).run()
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/utils/disk_utilization.py", line 102, in run
return self.handle_limit_not_set()
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/utils/disk_utilization.py", line 146, in handle_limit_not_set
return self.func(*self.args, **self.kwargs)
File "/opt/conda/lib/python3.10/site-packages/neptune/internal/operation_processors/sync_operation_processor.py", line 82, in enqueue_operation
raise errors[0]
neptune.exceptions.MetadataInconsistency: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 49.0
Thanks for bringing this issue to our notice. It should be a warning and not a fatal error. I was able to reproduce it and have created a ticket for the team to look into it.
Meanwhile, would you be open to slightly changing the Lightning source code to handle this error temporarily?
In the lightning > pytorch > loggers > neptune.py file, update the log_metrics()
function to as follows:
from neptune.exceptions import MetadataInconsistency
for key, val in metrics.items():
try:
self.run[key].append(val, step=step)
except MetadataInconsistency as e:
log.warning(f"Failed to log metric {key} with value {val} at step {step} due to: {e}")
We are just wrapping the append()
call within a try...except
block. This should handle the exception thrown.
Would this work as a band-aid?