dns_resolver: new IPs aren't added or no DNS refresh
cageyv opened this issue · 2 comments
Problem description
dns_resolver
doesn't refresh the DNS over time, or we don't know how to make this configuration.- We are using
round_robin
lb_policy_name. In the case ofTRANSIENT_FAILURE
DNS, get refreshed after1.11.3
bug fix. - If we remove 1 server instance and add it back with a new IP, then it will not be used because
dns_resolver
will not refresh the DNS - This blog post: https://arpittech.medium.com/grpc-and-connection-pooling-49a4137095e7 describes that problem with scaling. Only "The problem" part.
Reproduction steps
const grpc = require("@grpc/grpc-js");
const { GrpcTransport } = require("@protobuf-ts/grpc-transport");
tc = require("./proto-gen/demo/demoapp/v1/demoapp.client")
// Define a function to create a gRPC client with round_robin load balancing
function createGrpcClient(url) {
const rpcTransport = new GrpcTransport({
host: url,
channelCredentials: grpc.credentials.createInsecure(),
clientOptions: {
'grpc.lb_policy_name': 'round_robin',
'grpc.service_config': JSON.stringify({ loadBalancingConfig: [{ round_robin: {} }] })
}
});
return new tc.TaxpnlgraphServiceClient(rpcTransport)
}
For the server-side backends, it is DNS-based load balancing.
In our case, this is golang gRPC server. 5 instances, different IP and multivalve DNS.
Locally, we could use docker-compose aliases
version: '3'
services:
server:
image: grpc/java-example-hostname:1.68.1
restart: always
networks:
default:
aliases:
- grpc-server.local
server:
image: grpc/java-example-hostname:1.68.1
restart: always
networks:
default:
aliases:
- grpc-server.local
Environment
- OS name, version and architecture:
public.ecr.aws/docker/library/node:v16-alpine3.17
Apline x64
- Node version:
^16
- Node installation method:
npm ci
- Package name and versio:
"@grpc/grpc-js": "1.11.3"
,"@protobuf-ts/grpc-transport": "2.9.4"
Additional context
What we did:
- First it was 8 instances backend and we ran the script. Script reach all of them.
- We redeployed the service and deployment, replaced all 8 backends and script able to reconnect (it was fixed in 1.11.3)
- We stop/remove 1 instance. Script still connected to 7 of them
- We run/add 1 instance. Script still connected to 7 of them.
- Here we expect the DNS refresh and +1 connection
- Wait additional 5 min. No results.
- More notes. New tasks: 10.0.172.74, 10.0.131.158; -1 task was 10.0.159.49; +1 task was 10.0.185.163
Our general recommendation is that servers should drop connections periodically to signal to clients that they should update name resolution information. In grpc-js, you can do this by using the grpc.max_connection_age_ms
and grpc.max_connection_age_grace_ms
options on the server. The grpc.max_connection_age_ms
should be tuned based on how frequently you expect clients to need to get new DNS resolution information. The grpc.max_connection_age_grace_ms
controls how long a server will spend processing requests on a connection after telling the client to stop using that connection, so you should set that based on the longest it generally takes the server to process a request.
As a side note, the grpc.lb_policy_name
is obsolete, and the grpc.service_config
option is its replacement. There is no reason to specify both.
Thanks for the recommendation. Sounds good.
I will try it. As soon as I have any news I will update that issue