googleapis/python-bigquery

Retry requests.exceptions.ConnectionError

nitishxp opened this issue · 6 comments

Hi Team,

Could you add a retry to this exception? We are running this code in Cloud Function and GKE infrastructure from time to time we get these errors

Bigquery SDK == google-cloud-bigquery==3.18.0

Error Type: <class 'requests.exceptions.ConnectionError'> error: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) 
Traceback (most recent call last): File "/workspace/visionCommon/common.py", line 117, 
in wrapper return func(*args, **kwargs) File "/workspace/main.py", line 576, in controller current_step = check_eligibility() File "/workspace/main.py", line 314, 
in check_eligibility total_rows = service.execute_query(query).result().total_rows File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/cloud/bigquery/job/query.py", line 1595, 
in result do_get_result() File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/api_core/retry/retry_unary.py", line 293, 
in retry_wrapped_func return retry_target( File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/api_core/retry/retry_unary.py", line 153, 
in retry_target _retry_error_helper( File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/api_core/retry/retry_base.py", line 212, 
in _retry_error_helper raise final_exc from source_exc File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/api_core/retry/retry_unary.py", line 144, 
in retry_target result = target() File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/cloud/bigquery/job/query.py", line 1584, 
in do_get_result super(QueryJob, self).result(retry=retry, timeout=timeout) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/cloud/bigquery/job/base.py", line 971, 
in result return super(_AsyncJob, self).result(timeout=timeout, **kwargs) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/api_core/future/polling.py", line 256, 
in result self._blocking_poll(timeout=timeout, retry=retry, polling=polling) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/cloud/bigquery/job/query.py", line 1326, 
in _blocking_poll super(QueryJob, self)._blocking_poll(timeout=timeout, **kwargs) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/api_core/future/polling.py", line 137, 
in _blocking_poll polling(self._done_or_raise)(retry=retry) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/api_core/retry/retry_unary.py", line 293, 
in retry_wrapped_func return retry_target( File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/api_core/retry/retry_unary.py", line 153, 
in retry_target _retry_error_helper( File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/api_core/retry/retry_base.py", line 212, 
in _retry_error_helper raise final_exc from source_exc File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/api_core/retry/retry_unary.py", line 144, 
in retry_target result = target() File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/cloud/bigquery/job/query.py", line 1448, 
in _done_or_raise self._reload_query_results(retry=retry, timeout=transport_timeout) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/cloud/bigquery/job/query.py", line 1429, 
in _reload_query_results self._query_results = self._client._get_query_results( File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/cloud/bigquery/client.py", line 1936, 
in _get_query_results resource = self._call_api( File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/cloud/bigquery/client.py", line 827, 
in _call_api return call() File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/cloud/_http/__init__.py", line 482, 
in api_request response = self._make_request( File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/cloud/_http/__init__.py", line 341, 
in _make_request return self._do_request( File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/cloud/_http/__init__.py", line 379, 
in _do_request return self.http.request( File "/layers/google.python.pip/pip/lib/python3.9/site-packages/google/auth/transport/requests.py", line 541, 
in request response = super(AuthorizedSession, self).request( File "/layers/google.python.pip/pip/lib/python3.9/site-packages/requests/sessions.py", line 589, 
in request resp = self.send(prep, **send_kwargs) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/requests/sessions.py", line 703, 
in send r = adapter.send(request, **kwargs) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/requests/adapters.py", line 501, 
in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

Looks like we already retry this at at the API request level here: https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/retry.py#L32

It looks like the failure you're seeing is at the "retry the query" layer, which we need to be much more careful about. If your query is not idempotent (e.g. some DML & DDL queries), we don't want to retry without knowing the job has failed.

I would encourage updating to the latest version, as I made sure more of the API requests in the "wait for the query to finish" code path use our API-level retries in #1900.

Hi @tswast the query is idempotent how to force it to retry?
I have started to see this issue more frequently as on 20th there was some BigQuery Issue in us region

I would recommend passing in a custom value for the job_retry parameter.

See:

DEFAULT_JOB_RETRY = retry.Retry(
for what it currently is.

That said, I'm not sure why the call to jobs.getQueryResults isn't being retried here. I'll do some more investigation.

I think I'm able to reproduce this at HEAD with the following test:

def test_retry_connection_error_with_default_retry_and_job_retry(monkeypatch, client):
    """
    Make sure ConnectionError can be retried at `is_job_done` level, even if
    retries are exhaused by API-level retry.

    Note: Because restart_query_job is set to True only in the case of a
    confirmed job failure, this should be safe to do even when a job is not
    idempotent.

    Regression test for issue
    https://github.com/googleapis/python-bigquery/issues/1929
    """
    job_counter = 0

    def make_job_id(*args, **kwargs):
        nonlocal job_counter
        job_counter += 1
        return f"{job_counter}"

    monkeypatch.setattr(_job_helpers, "make_job_id", make_job_id)
    conn = client._connection = make_connection()
    project = client.project
    job_reference_1 = {"projectId": project, "jobId": "1", "location": "test-loc"}
    NUM_API_RETRIES = 2

    with freezegun.freeze_time(
        "2024-01-01 00:00:00",
        # Note: because of exponential backoff and a bit of jitter,
        # NUM_API_RETRIES will get less accurate the greater the value.
        # We add 1 because we know there will be at least some additional
        # calls to fetch the time / sleep before the retry deadline is hit.
        auto_tick_seconds=(
            google.cloud.bigquery.retry._DEFAULT_RETRY_DEADLINE / NUM_API_RETRIES
        )
        + 1,
    ):
        conn.api_request.side_effect = [
            # jobs.insert
            {"jobReference": job_reference_1, "status": {"state": "PENDING"}},
            # jobs.get
            {"jobReference": job_reference_1, "status": {"state": "RUNNING"}},
            # jobs.getQueryResults x2
            requests.exceptions.ConnectionError(),
            requests.exceptions.ConnectionError(),
            # jobs.get
            # Job actually succeeeded, so we shouldn't be restarting the job,
            # even though we are retrying at the `is_job_done` level.
            {"jobReference": job_reference_1, "status": {"state": "DONE"}},
        ]

        job = client.query("select 1")
        job.result()

It never gets to the final jobs.get call, I think because _job_should_retry is returning False for RetryError.cause of type ConnectionError. I think because we have separate logic for when we should restart the query versus retry at this layer, it should be safe to retry here. That said, if we get here it's because some API request has already hit its retry timeout of 600 seconds, so I'm not sure how much the second layer of retries will help.

Fix awaiting review: #1930

@tswast Thank you again for your quick response and resolution to the issue :)