GoogleCloudPlatform/cloud-sql-python-connector

Race condition in Connector destructor

jackwotherspoon opened this issue · 1 comments

It seems as though #985 may have an introduced a potential race condition upon garbage collection. (there has not been a release yet with this bug)

If the Connector.close is not explicitly called during program execution and it is left up to garbage collection to clean up resources then it seems sometimes the following error occurs after a program exits and the Connector's __del__ destructor is called.

Exception ignored in: <function Connector.__del__ at 0x7fa28e575c60>
Traceback (most recent call last):
  File "/usr/local/cloud-sql-python-connector/google/cloud/sql/connector/connector.py", line 353, in __del__
    self.close()
  File "/usr/local/cloud-sql-python-connector/google/cloud/sql/connector/connector.py", line 333, in close
    close_future.result(timeout=5)
  File "/usr/local/.pyenv/versions/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 458, in result
    raise TimeoutError()
TimeoutError: 

What I believe may (haven't dug too deep) be happening is that there is a race condition on the garbage collection of the Instance() and Connector(). The instance class currently holds the aiohttp.ClientSession atttribute that must be closed. The Connector.close calls Instance.close for all instances and attempts to close the client sessions. However, if Instance is garbage collected first then the nature of the async client session makes Connector.close() not be able to close the client session and it hangs and then times out as we see in the error.

This error does not occur in the AlloyDB Python Connector which is why the above hypothesis may stand true, hopefully moving the client up to the Connector level and out of the Instance as part of #873 will resolve this.

@enocom I consider this a p1 and blocking the next release as this will likely cause certain environments and programs to fail when exiting.