grpc/grpc-node

Simultaneously ping may cause server hangs

ryojikamei opened this issue · 6 comments

Problem description

When two nodes send messages to each other almost simultaneously, one of two server may hang.

Reproduction steps

  1. Git clone the example code: https://github.com/ryojikamei/repro1
  2. cd repro1
  3. node dist/run_2nodes.js
    Run it two or three times and we will have problems.

Environment

  • OS name, version and architecture: Linux Ubuntu 22.04.1 amd64
  • Node version: 18.20.3
  • Node installation method: n
  • Package name and version: @grpc/grpc-js 1.11.1

Additional context

This example uses a duplex stream, but I remember that the same problem can occur when written in unary.
But, I am unable to prepare a reproducible code in unary. That is somewhat difficult to reproduce.

I ran your code several times, and the only errors I see are ECONNREFUSED errors when one client tries to connect before the other server starts. Then the client that failed never recovers, but that's just because you're not creating a new call when you try again. Specifically in this line, you only create a new call if one doesn't already exist, but you don't delete the existing one when it fails.

Thank you for your response.

you don't delete the existing one when it fails.

I was under the impression that once a channel failed to communicate, it would be automatically recovered, but I was wrong, and that I would have to manually recreate the channel, is that correct?

According to the above log, server-7022 has received a second ping. However, it does not attempt to return the pong. I have no idea why it behaves this way, but anyway, I will try to re-write code to always recreate the channel when it fails.

I did not say that you should recreate the channel. The call is a separate object from the channel. The channel (or more accurately the client that owns a channel) is the object you create here and the call is the object you create here. You should persist the client object for as long as possible, and you need to create a new call every time there is an error. A call represents a single request, and an error indicates that the request is finished.

20240801.txt

Thank you very much. Now it works. I was not aware of the difference between call and channel correctly.
I have been struggling with this issue for two months and as a result, I could not see my mistake. My apologies. Please close this issue.

P.S. I have googled documentation on the difference between a call and a channel, but so far have not found a single hit. I assume that the official reference document is probably the only source of information. At least in my native language, I found zero information.