grpc/grpc-node

dns_resolver: new IPs aren't added or no DNS refresh

cageyv opened this issue · 2 comments

Problem description

  • dns_resolver doesn't refresh the DNS over time, or we don't know how to make this configuration.
  • We are using round_robin lb_policy_name. In the case of TRANSIENT_FAILURE DNS, get refreshed after 1.11.3 bug fix.
  • If we remove 1 server instance and add it back with a new IP, then it will not be used because dns_resolver will not refresh the DNS
  • This blog post: https://arpittech.medium.com/grpc-and-connection-pooling-49a4137095e7 describes that problem with scaling. Only "The problem" part.

Reproduction steps

const grpc = require("@grpc/grpc-js");
const { GrpcTransport } = require("@protobuf-ts/grpc-transport");
tc = require("./proto-gen/demo/demoapp/v1/demoapp.client")

// Define a function to create a gRPC client with round_robin load balancing
function createGrpcClient(url) {
    const rpcTransport = new GrpcTransport({
        host: url,
        channelCredentials: grpc.credentials.createInsecure(),
        clientOptions: {
            'grpc.lb_policy_name': 'round_robin',
            'grpc.service_config': JSON.stringify({ loadBalancingConfig: [{ round_robin: {} }] })
        }
      });

  return new tc.TaxpnlgraphServiceClient(rpcTransport)
}

For the server-side backends, it is DNS-based load balancing.
In our case, this is golang gRPC server. 5 instances, different IP and multivalve DNS.
Locally, we could use docker-compose aliases

version: '3'
services:
  server:
    image: grpc/java-example-hostname:1.68.1
    restart: always
    networks:
      default:
        aliases:
          - grpc-server.local
  server:
    image: grpc/java-example-hostname:1.68.1
    restart: always
    networks:
      default:
        aliases:
          - grpc-server.local

Environment

  • OS name, version and architecture: public.ecr.aws/docker/library/node:v16-alpine3.17 Apline x64
  • Node version: ^16
  • Node installation method: npm ci
  • Package name and versio: "@grpc/grpc-js": "1.11.3" , "@protobuf-ts/grpc-transport": "2.9.4"

Additional context

What we did:

  • First it was 8 instances backend and we ran the script. Script reach all of them.
  • We redeployed the service and deployment, replaced all 8 backends and script able to reconnect (it was fixed in 1.11.3)
  • We stop/remove 1 instance. Script still connected to 7 of them
  • We run/add 1 instance. Script still connected to 7 of them.
  • Here we expect the DNS refresh and +1 connection
  • Wait additional 5 min. No results.
  • More notes. New tasks: 10.0.172.74, 10.0.131.158; -1 task was 10.0.159.49; +1 task was 10.0.185.163

log.txt

Our general recommendation is that servers should drop connections periodically to signal to clients that they should update name resolution information. In grpc-js, you can do this by using the grpc.max_connection_age_ms and grpc.max_connection_age_grace_ms options on the server. The grpc.max_connection_age_ms should be tuned based on how frequently you expect clients to need to get new DNS resolution information. The grpc.max_connection_age_grace_ms controls how long a server will spend processing requests on a connection after telling the client to stop using that connection, so you should set that based on the longest it generally takes the server to process a request.

As a side note, the grpc.lb_policy_name is obsolete, and the grpc.service_config option is its replacement. There is no reason to specify both.

Thanks for the recommendation. Sounds good.
I will try it. As soon as I have any news I will update that issue