dns_resolver: new IPs aren't added or no DNS refresh

Question

dns_resolver: new IPs aren't added or no DNS refresh

cageyv opened this issue 2 months ago · 2 comments

Problem description

dns_resolver doesn't refresh the DNS over time, or we don't know how to make this configuration.
We are using round_robin lb_policy_name. In the case of TRANSIENT_FAILURE DNS, get refreshed after 1.11.3 bug fix.
If we remove 1 server instance and add it back with a new IP, then it will not be used because dns_resolver will not refresh the DNS
This blog post: https://arpittech.medium.com/grpc-and-connection-pooling-49a4137095e7 describes that problem with scaling. Only "The problem" part.

Reproduction steps

const grpc = require("@grpc/grpc-js");
const { GrpcTransport } = require("@protobuf-ts/grpc-transport");
tc = require("./proto-gen/demo/demoapp/v1/demoapp.client")

// Define a function to create a gRPC client with round_robin load balancing
function createGrpcClient(url) {
    const rpcTransport = new GrpcTransport({
        host: url,
        channelCredentials: grpc.credentials.createInsecure(),
        clientOptions: {
            'grpc.lb_policy_name': 'round_robin',
            'grpc.service_config': JSON.stringify({ loadBalancingConfig: [{ round_robin: {} }] })
        }
      });

  return new tc.TaxpnlgraphServiceClient(rpcTransport)
}

For the server-side backends, it is DNS-based load balancing.
In our case, this is golang gRPC server. 5 instances, different IP and multivalve DNS.
Locally, we could use docker-compose aliases

version: '3'
services:
  server:
    image: grpc/java-example-hostname:1.68.1
    restart: always
    networks:
      default:
        aliases:
          - grpc-server.local
  server:
    image: grpc/java-example-hostname:1.68.1
    restart: always
    networks:
      default:
        aliases:
          - grpc-server.local

Environment

OS name, version and architecture: public.ecr.aws/docker/library/node:v16-alpine3.17 Apline x64
Node version: ^16
Node installation method: npm ci
Package name and versio: "@grpc/grpc-js": "1.11.3" , "@protobuf-ts/grpc-transport": "2.9.4"

Additional context

What we did:

First it was 8 instances backend and we ran the script. Script reach all of them.
We redeployed the service and deployment, replaced all 8 backends and script able to reconnect (it was fixed in 1.11.3)
We stop/remove 1 instance. Script still connected to 7 of them
We run/add 1 instance. Script still connected to 7 of them.
Here we expect the DNS refresh and +1 connection
Wait additional 5 min. No results.
More notes. New tasks: 10.0.172.74, 10.0.131.158; -1 task was 10.0.159.49; +1 task was 10.0.185.163

log.txt

Answer 1 · 2024-11-12T19:32:38.000Z

Our general recommendation is that servers should drop connections periodically to signal to clients that they should update name resolution information. In grpc-js, you can do this by using the grpc.max_connection_age_ms and grpc.max_connection_age_grace_ms options on the server. The grpc.max_connection_age_ms should be tuned based on how frequently you expect clients to need to get new DNS resolution information. The grpc.max_connection_age_grace_ms controls how long a server will spend processing requests on a connection after telling the client to stop using that connection, so you should set that based on the longest it generally takes the server to process a request.

As a side note, the grpc.lb_policy_name is obsolete, and the grpc.service_config option is its replacement. There is no reason to specify both.

Answer 2 · 2024-11-12T20:00:12.000Z

Thanks for the recommendation. Sounds good.
I will try it. As soon as I have any news I will update that issue