neptune-ai/neptune-client

ZD745: Neptune synchronization throws Unauthorized error

mrembalski opened this issue ยท 14 comments

In this case, after 17 hours of training, NeptuneAsyncOpProcessor received HTTPUnauthorized: 401 error, resulting in the stop of the process (and in my case, then SLURM killing the entire training).

Exception in thread NeptuneAsyncOpProcessor:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/swagger_client_wrapper.py", line 97, in __call__
    return FinishedApiResponseFuture(future.response())  # wait synchronously
  File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 200, in response
    swagger_result = self._get_swagger_result(incoming_response)
  File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
    unmarshal_response(
  File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 353, in unmarshal_response
    raise_on_expected(incoming_response)
  File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 420, in raise_on_expected
    raise make_http_exception(
bravado.exception.HTTPUnauthorized: 401

#Environment

neptune                        1.4.1
pytorch-lightning         2.0.6
torch                             2.1.0.dev20230805+cu121

Singularity> python --version
Python 3.10.10

Hello @mrembalski ๐Ÿ‘‹
Could you share the time (with timezone) when you got this exception?

I will check with engineering and let you know if they need further details.

Hi @SiddhantSadangi,
Probably at about 2023/08/30 10:33:28 (shown on Neptune, which is probably the same as mine - GMT+2).

Is it full Traceback? Does not look like. swagger_client_wrapper should never be a top level caller.

The whole traceback:


slurmstepd: error: _handle_stat_jobacct: Took usec=225274836, which is more than MessageTimeout (200s). The result won't be delivered
Exception in thread NeptuneAsyncOpProcessor:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/swagger_client_wrapper.py", line 97, in __call__
    return FinishedApiResponseFuture(future.response())  # wait synchronously
  File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 200, in response
    swagger_result = self._get_swagger_result(incoming_response)
  File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
    unmarshal_response(
  File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 353, in unmarshal_response
    raise_on_expected(incoming_response)
  File "/usr/local/lib/python3.10/site-packages/bravado/http_future.py", line 420, in raise_on_expected
    raise make_http_exception(
bravado.exception.HTTPUnauthorized: 401

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 228, in run
    super().run()
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/threading/daemon.py", line 95, in run
    self.work()
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 244, in work
    self.process_batch([element.obj for element in batch], batch[-1].ver)
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/threading/daemon.py", line 120, in wrapper
    result = func(self_, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 257, in process_batch
    processed_count, errors = self._processor._backend.execute_operations(
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 484, in execute_operations
    self._execute_operations(
  File "/usr/local/lib/python3.10/site-packages/neptune/common/backends/utils.py", line 71, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 642, in _execute_operations
    result = self.leaderboard_client.api.executeOperations(**kwargs).response().result
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/swagger_client_wrapper.py", line 99, in __call__
    self.handle_neptune_http_errors(e.response, exception=e)
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/swagger_client_wrapper.py", line 88, in handle_neptune_http_errors
    handle_json_errors(
  File "/usr/local/lib/python3.10/site-packages/neptune/api/exceptions_utils.py", line 36, in handle_json_errors
    raise error_processor(content) from source_exception
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/backends/swagger_client_wrapper.py", line 57, in <lambda>
    "AUTHORIZATION_TOKEN_EXPIRED": lambda _: NeptuneAuthTokenExpired(),
  File "/usr/local/lib/python3.10/site-packages/neptune/common/exceptions.py", line 219, in __init__
    super().__init__("Authorization token expired")
TypeError: Unauthorized.__init__() takes 1 positional argument but 2 were given


----NeptuneSynchronizationAlreadyStopped---------------------------------------------------

The synchronization thread had stopped before Neptune could finish uploading the logged metadata.
Your data is stored locally, but you'll need to finish the synchronization manually.
To synchronize with the Neptune servers, enter the following on your command line:

    neptune sync

For details, see https://docs.neptune.ai/api/neptune_sync/

If the synchronization fails, you may want to check your connection and ensure that you're
within limits by going to your Neptune project settings -> Usage.
If the issue persists, our support is happy to help.

Need help?-> https://docs.neptune.ai/getting_help

Traceback (most recent call last):
  File "/experiment/universal_learning_pipeline/ulp/pipeline_utils.py", line 107, in train_and_test
    trainer.fit(model, data_module)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 529, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 41, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 91, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 568, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 973, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1016, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 134, in run
    self.on_advance_end()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 248, in on_advance_end
    self.val_loop.run()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 177, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 122, in run
    return self.on_run_end()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 258, in on_run_end
    self._on_evaluation_end()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 303, in _on_evaluation_end
    call._call_callback_hooks(trainer, hook_name, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 193, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 311, in on_validation_end
    self._save_topk_checkpoint(trainer, monitor_candidates)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 362, in _save_topk_checkpoint
    self._save_none_monitor_checkpoint(trainer, monitor_candidates)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 671, in _save_none_monitor_checkpoint
    self._save_checkpoint(trainer, filepath)
  File "/experiment/universal_learning_pipeline/ulp/callbacks/s3_checkpoint.py", line 109, in _save_checkpoint
    self.save_s3_checkpoint_to_neptune_logger(trainer, filepath)
  File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
  File "/experiment/universal_learning_pipeline/ulp/callbacks/s3_checkpoint.py", line 78, in save_s3_checkpoint_to_neptune_logger
    trainer.logger.run.wait()    # type: ignore
  File "/usr/local/lib/python3.10/site-packages/neptune/metadata_containers/metadata_container.py", line 503, in wait
    self._op_processor.wait()
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 110, in wait
    raise NeptuneSynchronizationAlreadyStoppedException()
neptune.exceptions.NeptuneSynchronizationAlreadyStoppedException:

----NeptuneSynchronizationAlreadyStopped---------------------------------------------------

The synchronization thread had stopped before Neptune could finish uploading the logged metadata.
Your data is stored locally, but you'll need to finish the synchronization manually.
To synchronize with the Neptune servers, enter the following on your command line:

    neptune sync

For details, see https://docs.neptune.ai/api/neptune_sync/

If the synchronization fails, you may want to check your connection and ensure that you're
within limits by going to your Neptune project settings -> Usage.
If the issue persists, our support is happy to help.

Need help?-> https://docs.neptune.ai/getting_help

Error executing job with overrides: ['model.model_config.criterion.weights=[0.7,0.3]', 'model.model_config.lr=0.001', 'model.model_config.optimizer.weight_decay=0.0', '+commit_hash=3040431d9b', '+task_name=aquatic_disability_30404/3', '+timeout=48']
Traceback (most recent call last):
  File "/experiment/universal_learning_pipeline/ulp/run.py", line 80, in <module>
    main()
  File "/usr/local/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/experiment/universal_learning_pipeline/ulp/run.py", line 75, in main
    run(config)
  File "/experiment/universal_learning_pipeline/ulp/run.py", line 69, in run
    pipeline_utils.train_and_test(experiment_config, logger, ckpt_path, resume_from_ckpt=False)
  File "/experiment/universal_learning_pipeline/ulp/pipeline_utils.py", line 128, in train_and_test
    raise exc
  File "/experiment/universal_learning_pipeline/ulp/pipeline_utils.py", line 107, in train_and_test
    trainer.fit(model, data_module)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 529, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 41, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 91, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 568, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 973, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1016, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 134, in run
    self.on_advance_end()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 248, in on_advance_end
    self.val_loop.run()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 177, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 122, in run
    return self.on_run_end()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 258, in on_run_end
    self._on_evaluation_end()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 303, in _on_evaluation_end
    call._call_callback_hooks(trainer, hook_name, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 193, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 311, in on_validation_end
    self._save_topk_checkpoint(trainer, monitor_candidates)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 362, in _save_topk_checkpoint
    self._save_none_monitor_checkpoint(trainer, monitor_candidates)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 671, in _save_none_monitor_checkpoint
    self._save_checkpoint(trainer, filepath)
  File "/experiment/universal_learning_pipeline/ulp/callbacks/s3_checkpoint.py", line 109, in _save_checkpoint
    self.save_s3_checkpoint_to_neptune_logger(trainer, filepath)
  File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
  File "/experiment/universal_learning_pipeline/ulp/callbacks/s3_checkpoint.py", line 78, in save_s3_checkpoint_to_neptune_logger
    trainer.logger.run.wait()    # type: ignore
  File "/usr/local/lib/python3.10/site-packages/neptune/metadata_containers/metadata_container.py", line 503, in wait
    self._op_processor.wait()
  File "/usr/local/lib/python3.10/site-packages/neptune/internal/operation_processors/async_operation_processor.py", line 110, in wait
    raise NeptuneSynchronizationAlreadyStoppedException()
neptune.exceptions.NeptuneSynchronizationAlreadyStoppedException:

----NeptuneSynchronizationAlreadyStopped---------------------------------------------------

The synchronization thread had stopped before Neptune could finish uploading the logged metadata.
Your data is stored locally, but you'll need to finish the synchronization manually.
To synchronize with the Neptune servers, enter the following on your command line:

    neptune sync

For details, see https://docs.neptune.ai/api/neptune_sync/

If the synchronization fails, you may want to check your connection and ensure that you're
within limits by going to your Neptune project settings -> Usage.
If the issue persists, our support is happy to help.

Need help?-> https://docs.neptune.ai/getting_help

Hey @mrembalski. What is your neptune username and the corresponding project name?

@mrembalski - just to add, if you are not comfortable sharing any details here, you can always send them over through chat or support@neptune.ai :)

@SiddhantSadangi I sent an email:)

Thanks... we'll take this forward over email โœ…

@SiddhantSadangi
I have encountered the issue once again, but this time with a newer package. Also sent out an email with the runs that it happened to.

Hey @mrembalski ,
Thanks for sharing the details. We have already started working on it, and will keep you updated via email :)

Hey @mrembalski ,

Sorry for the delay here ๐Ÿ™
We've released Neptune 1.6.3 with a potential fix.

Could you upgrade your version of the client and let us know if this works for you?

Hi @SiddhantSadangi,
I'm updating, I'll try to give an update later today/tomorrow.
Thanks!

Hey @mrembalski ,
Just checking if you were able to give this a shot and if it works fine for you now

Hey @mrembalski ,
I am closing this issue for now, but please feel free to reopen it if it is not resolved for you.