googleapis/gax-ruby

GaxError Exception occurred in retry method

Closed this issue · 9 comments

Hi,

While using Gax in google-cloud-language development, both my local testing and our Travis CI build sometimes get the following error. Retrying the call usually succeeds.

The backtrace pasted below is from Job #1433, which should contain all details about the environment.

Thank you!

Google::Gax::RetryError: GaxError Exception occurred in retry method that was not classified as transient, caused by 14:{"created":"@1472169812.007530609","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":157,"grpc_status":14,"referenced_errors":[{"created":"@1472169812.007501620","description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235}]}
    /home/travis/.rvm/gems/ruby-2.2.5/gems/google-gax-0.4.4/lib/google/gax/api_callable.rb:352:in `rescue in block (2 levels) in retryable'
    /home/travis/.rvm/gems/ruby-2.2.5/gems/google-gax-0.4.4/lib/google/gax/api_callable.rb:346:in `block (2 levels) in retryable'
    /home/travis/.rvm/gems/ruby-2.2.5/gems/google-gax-0.4.4/lib/google/gax/api_callable.rb:345:in `loop'
    /home/travis/.rvm/gems/ruby-2.2.5/gems/google-gax-0.4.4/lib/google/gax/api_callable.rb:345:in `block in retryable'
    /home/travis/.rvm/gems/ruby-2.2.5/gems/google-gax-0.4.4/lib/google/gax/api_callable.rb:264:in `call'
    /home/travis/.rvm/gems/ruby-2.2.5/gems/google-gax-0.4.4/lib/google/gax/api_callable.rb:264:in `block in catch_errors'
    /home/travis/.rvm/gems/ruby-2.2.5/gems/google-gax-0.4.4/lib/google/gax/api_callable.rb:226:in `call'
    /home/travis/.rvm/gems/ruby-2.2.5/gems/google-gax-0.4.4/lib/google/gax/api_callable.rb:226:in `block in create_api_call'
    /home/travis/.rvm/gems/ruby-2.2.5/gems/google-gax-0.4.4/lib/google/gax/api_callable.rb:252:in `call'
    /home/travis/.rvm/gems/ruby-2.2.5/gems/google-gax-0.4.4/lib/google/gax/api_callable.rb:252:in `block in create_api_call'
    /home/travis/build/GoogleCloudPlatform/google-cloud-ruby/google-cloud-language/lib/google/cloud/language/v1beta1/language_service_api.rb:198:in `call'
    /home/travis/build/GoogleCloudPlatform/google-cloud-ruby/google-cloud-language/lib/google/cloud/language/v1beta1/language_service_api.rb:198:in `annotate_text'
    /home/travis/build/GoogleCloudPlatform/google-cloud-ruby/google-cloud-language/lib/google/cloud/language/service.rb:80:in `block in annotate'
    /home/travis/build/GoogleCloudPlatform/google-cloud-ruby/google-cloud-language/lib/google/cloud/language/service.rb:105:in `execute'
    /home/travis/build/GoogleCloudPlatform/google-cloud-ruby/google-cloud-language/lib/google/cloud/language/service.rb:80:in `annotate'
    /home/travis/build/GoogleCloudPlatform/google-cloud-ruby/google-cloud-language/lib/google/cloud/language/project.rb:176:in `annotate'
    /home/travis/build/GoogleCloudPlatform/google-cloud-ruby/google-cloud-language/acceptance/language/text_test.rb:60:in `block (3 levels) in <top (required)>'

Are errors like this to be expected? Do clients need to do anything to handle these errors?

jmuk commented

Hmm. It looks like a gRPC error occurred there (and GAX thinks it's not a retryable error, so it raises an exception).

I've never seen such error as long as I tried locally, so I am not sure why it happens on you or Travis. description":"Secure read failed" looks weird -- it happens some exceptional network failure happens.

@quartzmo has had this error locally, but I've never seen it.

jmuk commented

Looked into the message a bit more, the core reason of the failure seems to be "description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235, which is seen at the end of the first line. And that happens when recvmsg (in C) returns 0.

According to http://pubs.opengroup.org/onlinepubs/009695399/functions/recvmsg.html,

If no messages are available to be received and the peer has performed an orderly shutdown, recvmsg() shall return 0.

I am not sure why the connection was closed by the peer (i.e. google service), but I think that would be a network trouble on the machine.

Network trouble on the Google service? Or the client?

jmuk commented

I'm thinking about the client-side trouble.
In the case of Travis, I believe that the network connection from the Travis instance will be managed/supervised by Travis itself, and the connection could be forcibly closed by them for some reasons (time consuming, for example).

jmuk commented

Ugh, I've heard that some troubles happened on Google's service-side very recently and that might have caused this issue. They fixed the code, so, please double check if it still reproduces.

I can no longer reproduce, and I haven't seen it on Travis CI today either. Thanks for your help!

I'm seeing this error in both error_reporting and logging when running locally and on GAE. The error isn't consistent, and it doesn't seem to prevent my gRPC requests to go through successfully.