ZD745: Neptune synchronization throws Unauthorized error
mrembalski opened this issue ยท 14 comments
In this case, after 17 hours of training, NeptuneAsyncOpProcessor received HTTPUnauthorized: 401
error, resulting in the stop of the process (and in my case, then SLURM killing the entire training).
Exception in thread NeptuneAsyncOpProcessor:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/swagger_client_wrapper.py", line 97, in __call__
return FinishedApiResponseFuture(future.response()) # wait synchronously
File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 200, in response
swagger_result = self._get_swagger_result(incoming_response)
File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 124, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
unmarshal_response(
File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 353, in unmarshal_response
raise_on_expected(incoming_response)
File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 420, in raise_on_expected
raise make_http_exception(
bravado.exception.HTTPUnauthorized: 401
#Environment
neptune 1.4.1
pytorch-lightning 2.0.6
torch 2.1.0.dev20230805+cu121
Singularity> python --version
Python 3.10.10
Hello @mrembalski ๐
Could you share the time (with timezone) when you got this exception?
I will check with engineering and let you know if they need further details.
Hi @SiddhantSadangi,
Probably at about 2023/08/30 10:33:28 (shown on Neptune, which is probably the same as mine - GMT+2).
Is it full Traceback? Does not look like. swagger_client_wrapper
should never be a top level caller.
The whole traceback:
slurmstepd: error: _handle_stat_jobacct: Took usec=225274836, which is more than MessageTimeout (200s). The result won't be delivered
Exception in thread NeptuneAsyncOpProcessor:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/swagger_client_wrapper.py", line 97, in __call__
return FinishedApiResponseFuture(future.response()) # wait synchronously
File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 200, in response
swagger_result = self._get_swagger_result(incoming_response)
File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 124, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
unmarshal_response(
File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 353, in unmarshal_response
raise_on_expected(incoming_response)
File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 420, in raise_on_expected
raise make_http_exception(
bravado.exception.HTTPUnauthorized: 401
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 228, in run
super().run()
File "/usr/local/lib/python3.10/site-packages/neptune/internal/threading/daemon.py", line 95, in run
self.work()
File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 244, in work
self.process_batch([element.obj for element in batch], batch[-1].ver)
File "/usr/local/lib/python3.10/site-packages/neptune/internal/threading/daemon.py", line 120, in wrapper
result = func(self_, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 257, in process_batch
processed_count, errors = self._processor._backend.execute_operations(
File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 484, in execute_operations
self._execute_operations(
File "/usr/local/lib/python3.10/site-packages/neptune/common/backends/utils.py", line 71, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 642, in _execute_operations
result = self.leaderboard_client.api.executeOperations(**kwargs).response().result
File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/swagger_client_wrapper.py", line 99, in __call__
self.handle_neptune_http_errors(e.response, exception=e)
File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/swagger_client_wrapper.py", line 88, in handle_neptune_http_errors
handle_json_errors(
File "/usr/local/lib/python3.10/site-packages/neptune/api/exceptions_utils.py", line 36, in handle_json_errors
raise error_processor(content) from source_exception
File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/swagger_client_wrapper.py", line 57, in <lambda>
"AUTHORIZATION_TOKEN_EXPIRED": lambda _: NeptuneAuthTokenExpired(),
File "/usr/local/lib/python3.10/site-packages/neptune/common/exceptions.py", line 219, in __init__
super().__init__("Authorization token expired")
TypeError: Unauthorized.__init__() takes 1 positional argument but 2 were given
----NeptuneSynchronizationAlreadyStopped---------------------------------------------------
The synchronization thread had stopped before Neptune could finish uploading the logged metadata.
Your data is stored locally, but you'll need to finish the synchronization manually.
To synchronize with the Neptune servers, enter the following on your command line:
neptune sync
For details, see https://docs.neptune.ai/api/neptune_sync/
If the synchronization fails, you may want to check your connection and ensure that you're
within limits by going to your Neptune project settings -> Usage.
If the issue persists, our support is happy to help.
Need help?-> https://docs.neptune.ai/getting_help
Traceback (most recent call last):
File "/experiment/universal_learning_pipeline/ulp/pipeline_utils.py", line 107, in train_and_test
trainer.fit(model, data_module)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 529, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 41, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 91, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 568, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 973, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1016, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
self.advance()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 354, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 134, in run
self.on_advance_end()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 248, in on_advance_end
self.val_loop.run()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 177, in _decorator
return loop_run(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 122, in run
return self.on_run_end()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 258, in on_run_end
self._on_evaluation_end()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 303, in _on_evaluation_end
call._call_callback_hooks(trainer, hook_name, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 193, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 311, in on_validation_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 362, in _save_topk_checkpoint
self._save_none_monitor_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 671, in _save_none_monitor_checkpoint
self._save_checkpoint(trainer, filepath)
File "/experiment/universal_learning_pipeline/ulp/callbacks/s3_checkpoint.py", line 109, in _save_checkpoint
self.save_s3_checkpoint_to_neptune_logger(trainer, filepath)
File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/rank_zero.py", line 32, in wrapped_fn
return fn(*args, **kwargs)
File "/experiment/universal_learning_pipeline/ulp/callbacks/s3_checkpoint.py", line 78, in save_s3_checkpoint_to_neptune_logger
trainer.logger.run.wait() # type: ignore
File "/usr/local/lib/python3.10/site-packages/neptune/metadata_containers/metadata_container.py", line 503, in wait
self._op_processor.wait()
File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 110, in wait
raise NeptuneSynchronizationAlreadyStoppedException()
neptune.exceptions.NeptuneSynchronizationAlreadyStoppedException:
----NeptuneSynchronizationAlreadyStopped---------------------------------------------------
The synchronization thread had stopped before Neptune could finish uploading the logged metadata.
Your data is stored locally, but you'll need to finish the synchronization manually.
To synchronize with the Neptune servers, enter the following on your command line:
neptune sync
For details, see https://docs.neptune.ai/api/neptune_sync/
If the synchronization fails, you may want to check your connection and ensure that you're
within limits by going to your Neptune project settings -> Usage.
If the issue persists, our support is happy to help.
Need help?-> https://docs.neptune.ai/getting_help
Error executing job with overrides: ['model.model_config.criterion.weights=[0.7,0.3]', 'model.model_config.lr=0.001', 'model.model_config.optimizer.weight_decay=0.0', '+commit_hash=3040431d9b', '+task_name=aquatic_disability_30404/3', '+timeout=48']
Traceback (most recent call last):
File "/experiment/universal_learning_pipeline/ulp/run.py", line 80, in <module>
main()
File "/usr/local/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/usr/local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/usr/local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/usr/local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/usr/local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/usr/local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/usr/local/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/usr/local/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/experiment/universal_learning_pipeline/ulp/run.py", line 75, in main
run(config)
File "/experiment/universal_learning_pipeline/ulp/run.py", line 69, in run
pipeline_utils.train_and_test(experiment_config, logger, ckpt_path, resume_from_ckpt=False)
File "/experiment/universal_learning_pipeline/ulp/pipeline_utils.py", line 128, in train_and_test
raise exc
File "/experiment/universal_learning_pipeline/ulp/pipeline_utils.py", line 107, in train_and_test
trainer.fit(model, data_module)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 529, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 41, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 91, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 568, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 973, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1016, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
self.advance()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 354, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 134, in run
self.on_advance_end()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 248, in on_advance_end
self.val_loop.run()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 177, in _decorator
return loop_run(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 122, in run
return self.on_run_end()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 258, in on_run_end
self._on_evaluation_end()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 303, in _on_evaluation_end
call._call_callback_hooks(trainer, hook_name, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 193, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 311, in on_validation_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 362, in _save_topk_checkpoint
self._save_none_monitor_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 671, in _save_none_monitor_checkpoint
self._save_checkpoint(trainer, filepath)
File "/experiment/universal_learning_pipeline/ulp/callbacks/s3_checkpoint.py", line 109, in _save_checkpoint
self.save_s3_checkpoint_to_neptune_logger(trainer, filepath)
File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/rank_zero.py", line 32, in wrapped_fn
return fn(*args, **kwargs)
File "/experiment/universal_learning_pipeline/ulp/callbacks/s3_checkpoint.py", line 78, in save_s3_checkpoint_to_neptune_logger
trainer.logger.run.wait() # type: ignore
File "/usr/local/lib/python3.10/site-packages/neptune/metadata_containers/metadata_container.py", line 503, in wait
self._op_processor.wait()
File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 110, in wait
raise NeptuneSynchronizationAlreadyStoppedException()
neptune.exceptions.NeptuneSynchronizationAlreadyStoppedException:
----NeptuneSynchronizationAlreadyStopped---------------------------------------------------
The synchronization thread had stopped before Neptune could finish uploading the logged metadata.
Your data is stored locally, but you'll need to finish the synchronization manually.
To synchronize with the Neptune servers, enter the following on your command line:
neptune sync
For details, see https://docs.neptune.ai/api/neptune_sync/
If the synchronization fails, you may want to check your connection and ensure that you're
within limits by going to your Neptune project settings -> Usage.
If the issue persists, our support is happy to help.
Need help?-> https://docs.neptune.ai/getting_help
Hey @mrembalski. What is your neptune username and the corresponding project name?
@mrembalski - just to add, if you are not comfortable sharing any details here, you can always send them over through chat or support@neptune.ai :)
@SiddhantSadangi I sent an email:)
Thanks... we'll take this forward over email โ
@SiddhantSadangi
I have encountered the issue once again, but this time with a newer package. Also sent out an email with the runs that it happened to.
Hey @mrembalski ,
Thanks for sharing the details. We have already started working on it, and will keep you updated via email :)
Hey @mrembalski ,
Sorry for the delay here ๐
We've released Neptune 1.6.3 with a potential fix.
Could you upgrade your version of the client and let us know if this works for you?
Hi @SiddhantSadangi,
I'm updating, I'll try to give an update later today/tomorrow.
Thanks!
Hey @mrembalski ,
Just checking if you were able to give this a shot and if it works fine for you now
Hey @mrembalski ,
I am closing this issue for now, but please feel free to reopen it if it is not resolved for you.