aws/aws-sdk-ruby

aws-sdk-sqs raises NoMethodError when messages not found

t-kinoshita opened this issue ยท 14 comments

Describe the bug

Error occaisonally happens inside QueuePoller#poll.

The error message indicates messages is nil.

Expected Behavior

No error happens.

Current Behavior

 undefined method `empty?' for nil:NilClass
  /opt/rubies/ruby-2.7.8/lib/ruby/gems/2.7.0/gems/aws-sdk-sqs-1.67.0/lib/aws-sdk-sqs/queue_poller.rb:358:in `block (2 levels) in poll'

Reproduction Steps

Needs more research

Possible Solution

No response

Additional Information/Context

The error starts on Nov 18, so AWS side might have changed at the timing.

Gem name ('aws-sdk', 'aws-sdk-resources' or service gems like 'aws-sdk-s3') and its version

aws-sdk-sqs-1.67.0

Environment details (Version of Ruby, OS environment)

ruby-2.7.8 on Amazon Linux 2

We're seeing very similar issues here, but for us it started today at roughly 7:24am UTC. This is the second issue we've had to deal with this weekend with SQS. First friday night, Shoryuken started failing on empty receives on this line, which we had to monkey patch to check for null responses (even though the AWS SDK states that it should return an empty array), because the messages response was intermittently coming back as nil instead of []. These issues are popping up without any changes on our end, ie. no deployments occurred when the errors started.

Thanks for reporting this. Recently sqs changed its wire protocol from query to json. Are you able to reproduce with http_wire_trace: true as a client option, to see if the service is returning messages at all? Does rolling back the gem version solve the issue?

We believe this is related to the protocol change from Query to aws json: eb6ac8c

The SDK is behaving correctly for the AWS json protocol: https://github.com/smithy-lang/smithy/blob/main/smithy-aws-protocol-tests/model/awsJson1_1 but I believe this is a change in behavior from the previous query protocol.

I'm a bit confused, is it not automatically using the json protocol when using 1.67+? or is there a flag that needs to be set to enable it?

FYI downgrading to gem version 1.65 appears to have fixed the issue for us.

The service accepts both formats. The older version of the gem will initiate the old format (query) and so the service responds that way. The new version of the gem initiates with the new format (json) and responds that way, too. The fix is unclear, and a few SDKs are affected, we are currently deliberating the correct approach. In the mean time, please use the older version.

Thank you for the explanation. Can you explain why we seemingly saw the issue start at 2 random time over the weekend? Was there a slow rollout of responding with the new format on the back end? Otherwise, I would have expected this to pop up right away when we updated to the new gem.

I lost a good chunk of my Saturday attempting to diagnose this issue, eventually monkeypatching the same line in Shoryuken that @mscrivo linked above. Appreciate the info about downgrading to 1.65.

Thank you for the explanation. Can you explain why we seemingly saw the issue start at 2 random time over the weekend? Was there a slow rollout of responding with the new format on the back end? Otherwise, I would have expected this to pop up right away when we updated to the new gem.

After switching protocols, SQS started sending a body like { "Message" [] } which was working in the Ruby SDK but was breaking other SDKs like Java. Over last weekend, SQS deployed a change that made the value null, "fixing the issue", so the body would come back as {}. The Ruby SDK then parsed this as nil messages.

After discussion within the greater SDK team, the Ruby SDK's behavior of default empty list was not correct, but we must preserve this behavior. I have a fix out that is pending review and protocol tests (this kind of change is considered high risk) #2948

โš ๏ธCOMMENT VISIBILITY WARNINGโš ๏ธ

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

A fix should be shipped in the next couple hours in core version 3.187.1

The gem has been released, please upgrade and let me know if it works. I'm sorry for any troubles this caused (and hopefully my change does not cause trouble too!) SQS protocol change was very high risk and unfortunately I'm just along for the ride.

โš ๏ธCOMMENT VISIBILITY WARNINGโš ๏ธ

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.