globus/globus-compute

Renewing auth token fails in multi-threaded applications

WardLT opened this issue · 6 comments

Describe the bug
Renewing tokens yields a SQLite error about multithreading when updating tokens in my multiprocess/multithreaded application (yes, I'm doing some stuff that Python isn't fantastic at). See the stacktrace below

To Reproduce
TBD. I'm not sure how to trigger a renewal process besides waiting a few days.

Expected behavior
Renewal happens without problems.

Environment

  • OS: CentOS @ client
  • OS & Container technology: CentOS
  • Python version @ 3.8
  • Python version @ 3.8
  • funcx version v1.0.0 @ client
  • funcx-endpoint v1.0.1 @ endpoint

Distributed Environment

  • Where are you running the funcX script from? Login node
  • Where does the endpoint run? ALCF Theta
  • What is your endpoint-uuid? ef78ac4c-413c-4fdc-8cff-fbbec2f352a5

Stacktrace

2022-08-25 22:19:00,474 - FuncX-Poller-Thread (139886623184640) - globus_sdk.authorizers.renewing - INFO - RenewingAuthorizer.access_token updated to token with hash "2fecd6dd88b269ebea1d1e1e2d18350c9aaa728bbc26dbcac3d4dbcdd9fc7d6c"
Exception in thread FuncX-Poller-Thread:
Traceback (most recent call last):
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/site-packages/funcx/sdk/executor.py", line 474, in event_loop_thread
    self.eventloop.run_until_complete(self.web_socket_poller())
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/site-packages/funcx/sdk/executor.py", line 481, in web_socket_poller
    await self.ws_handler.init_ws(start_message_handlers=False)
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/site-packages/funcx/sdk/asynchronous/ws_polling_task.py", line 104, in init_ws
    headers = [self.get_auth_header()]
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/site-packages/funcx/sdk/asynchronous/ws_polling_task.py", line 286, in get_auth_header
    authz_value = self.funcx_client.web_client.authorizer.get_authorization_header()
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/site-packages/globus_sdk/authorizers/renewing.py", line 167, in get_authorization_header
    self.ensure_valid_token()
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/site-packages/globus_sdk/authorizers/renewing.py", line 161, in ensure_valid_token
    self._get_new_access_token()
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/site-packages/globus_sdk/authorizers/renewing.py", line 134, in _get_new_access_token
    self.on_refresh(res)
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/site-packages/globus_sdk/tokenstorage/base.py", line 31, in on_refresh
    self.store(token_response)
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/site-packages/globus_sdk/tokenstorage/sqlite_adapter.py", line 158, in store
    self._connection.executemany(
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 139887974459200 and this is thread id 139886623184640.

globus-sdk clients are not threadsafe because they contain unsafe components. Namely: urllib3 connection pools. As a result, it has never seemed worthwhile to ensure that other SDK components are threadsafe -- therefore, I would not consider it a bug that tokenstorage is not threadsafe.

In the near term, I would expect that things work fine if you create a distinct client per thread, which is recommended by the SDK docs and therefore probably ought to be recommended by funcx docs. They'll share underlying storage, but they'll refresh independently. I have some thoughts about how that might go wrong, but they're mostly theoretical.

You can manually trigger refreshes by reaching into a client's authorizer object, but I'd put that down as "dis-recommended". I can explain more if you want to do it in order to test behaviors; just let me know.

Good to know. I'll revise my tooling to create a new client for the new process.

Shall I leave this issue open as a reminder to update the documentation?

Since I'm no longer actively involved in the project it's not really mine to say, but I would leave it open as a doc issue.

The globus-sdk docs cover this in a tiny note in an easy to miss place, but it means that the information is recorded somewhere. I'd suggest the same basic attitude here: funcx maintainers ought not to worry too much about making sure the relevant documentation is easily discoverable, but it should still exist.

I'm actually a little concerned the "multiple threads" thing is not my doing. I use multiprocessing and only operate on the FuncX client on a single thread from each process. Does that still sound problematic?

Also, looking at the stack trace, the FuncXExecutor is multithreaded. Could the problem be originating from FuncX?

Could the problem be originating from FuncX?

It could be -- and I think it is (more detail below). It most likely means that the behavior has been unsafe for some time, but the sqlite library is the first thing to explicitly notice it and force a failure.

Multiprocess usage should be generally safe -- once you've forked, the whole memory space is effectively copied into new objects. That eliminates concerns about threadsafety.

Looking at the trace more carefully, it appears that this is coming from the background threading done in the executor to handle async websockets client code. These lines in particular stand out as "model breaking" for the globus-sdk:

  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/site-packages/funcx/sdk/asynchronous/ws_polling_task.py", line 104, in init_ws
    headers = [self.get_auth_header()]
  File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env/lib/python3.8/site-packages/funcx/sdk/asynchronous/ws_polling_task.py", line 286, in get_auth_header
    authz_value = self.funcx_client.web_client.authorizer.get_authorization_header()

The authorizer object can be used in this way, but it's well outside of the SDK's target/supported use-cases.
In this particular case, it seems that a client and authorizer from another thread are being used, which runs into the failure.

I'm surprised at first that this isn't happening more often. My guess, which I haven't investigated yet, is that this codepath only fires in a bad way if the refresh is actually triggered from the background thread. If the thread where the client was created has already refreshed the tokens on that authorizer, it probably works fine.

Could we solve this by putting locks over the dangerous funcx_client interactions? Or, should we assume any interaction with the Globus SDK is not threadsafe and make a thread-local clone?