Azure/azure-cosmos-python

No retries on non-HTTP network issues

Closed this issue · 3 comments

When an HTTP error occurs, e.g. 429, then everything is handled as expected, triggering retries etc. and using the retry options of the connection policy of pydocumentdb.

However, if there is a network error such that a server is not reachable, then this results in an immediate exception without retries. This is because of two things:

  1. pydocumentdb's retry_utility code only handles errors.HTTPFailure errors, which are HTTP errors corresponding to certain HTTP status codes, e.g. 429: https://github.com/Azure/azure-documentdb-python/blob/07e2f3f93ad5abeb114c2d2f83577c25d18f0bb4/pydocumentdb/retry_utility.py#L66

  2. pydocumentdb uses requests to do the actual network requests, however it sets up the requests session with the defaults only which doesn't enable retrying: https://github.com/Azure/azure-documentdb-python/blob/07e2f3f93ad5abeb114c2d2f83577c25d18f0bb4/pydocumentdb/document_client.py#L134 This then in turn leads to the underlying urllib3 not to retry such requests: https://github.com/urllib3/urllib3/blob/1.19.1/urllib3/util/retry.py#L331-L336 (read would be false with default options).

An approach as described in https://www.peterbe.com/plog/best-practice-with-retries-with-requests is typically used to enable retries with requests:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry


def requests_retry_session(
    retries=3,
    backoff_factor=0.3,
    status_forcelist=(500, 502, 504),
    session=None,
):
    session = session or requests.Session()
    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session

# Usage example...

response = requests_retry_session().get('https://www.peterbe.com/')
print(response.status_code)

s = requests.Session()
s.auth = ('user', 'pass')
s.headers.update({'x-test': 'true'})

response = requests_retry_session(session=s).get(
    'https://www.peterbe.com'
)

The following is an exception trace resulting from trying to create a document in CosmosDB when the server is unreachable:

Traceback (most recent call last):
  File "C:\Python36\lib\site-packages\urllib3\connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "C:\Python36\lib\site-packages\urllib3\connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Python36\lib\http\client.py", line 1331, in getresponse
    response.begin()
  File "C:\Python36\lib\http\client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "C:\Python36\lib\http\client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "C:\Python36\lib\socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "C:\Python36\lib\ssl.py", line 1009, in recv_into
    return self.read(nbytes, buffer)
  File "C:\Python36\lib\ssl.py", line 871, in read
    return self._sslobj.read(len, buffer)
  File "C:\Python36\lib\ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python36\lib\site-packages\requests\adapters.py", line 445, in send
    timeout=timeout
  File "C:\Python36\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "C:\Python36\lib\site-packages\urllib3\util\retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "C:\Python36\lib\site-packages\urllib3\packages\six.py", line 686, in reraise
    raise value
  File "C:\Python36\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "C:\Python36\lib\site-packages\urllib3\connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "C:\Python36\lib\site-packages\urllib3\connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='....documents.azure.com', port=443): Read timed out. (read timeout=60.0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
...
  File "C:\Python36\lib\site-packages\pydocumentdb\document_client.py", line 947, in CreateDocument
    options)
  File "C:\Python36\lib\site-packages\pydocumentdb\document_client.py", line 2365, in Create
    headers)
  File "C:\Python36\lib\site-packages\pydocumentdb\document_client.py", line 2571, in __Post
    headers=headers)
  File "C:\Python36\lib\site-packages\pydocumentdb\synchronized_request.py", line 212, in SynchronizedRequest
    return retry_utility._Execute(client, global_endpoint_manager, _Request, connection_policy, requests_session, resource_url, request_options, request_body)
  File "C:\Python36\lib\site-packages\pydocumentdb\retry_utility.py", line 56, in _Execute
    result = _ExecuteFunction(function, *args, **kwargs)
  File "C:\Python36\lib\site-packages\pydocumentdb\retry_utility.py", line 92, in _ExecuteFunction
    return function(*args, **kwargs)
  File "C:\Python36\lib\site-packages\pydocumentdb\synchronized_request.py", line 127, in _Request
    verify = is_ssl_enabled)
  File "C:\Python36\lib\site-packages\requests\sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python36\lib\site-packages\requests\sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "C:\Python36\lib\site-packages\requests\adapters.py", line 526, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='....documents.azure.com', port=443): Read timed out. (read timeout=60.0)

We are also encountering this issue. The default requests is currently set to 60 and there is no way to change this or pass in retry logic.

Hi @srinathnarayanan can you take a look at this issue? It is currently affecting our application in production. I can provide more info... I looked at the next version at https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py and I don't see that this has been resolved there either. So probably this issue needs to be taken into the next version.

The Fix is present in V 3.1.2 and V4.0.0b4 of azure-cosmos