cadence-workflow/cadence

Ringpop discovery churn due to DNS truncation

Closed this issue · 1 comments

Is your feature request related to a problem? Please describe.

We are using Cadence with DNS ringpop and after scaling the frontend to more than 3 instances we are seeing a significant number of log lines such as:

Add new peers by DNS lookup and Remove stale peers by DNS lookup

Proposed Solution

We believe this is due to DNS truncation of the now larger responses. Prior to Go 1.19 the net package would truncate responses that were larger than 512bytes. Upgrading the Go version to >= 1.19 should fix this problem.

The Go 1.19 release notes

The pure Go resolver will now use EDNS(0) to include a suggested maximum reply packet length, permitting reply packets to contain up to 1232 bytes (the previous maximum was 512).

Additional context

Example of DNS truncation

> dig cadence.service.consul +noedns +short
10.0.0.216
10.0.0.145
10.0.0.79
> dig cadence.service.consul +short
10.0.0.145
10.0.0.216
10.0.0.233
10.0.0.79
10.0.0.17
10.0.0.185

Logs details

{
  "service": "cadence-frontend",
  "message": "Add new peers by DNS lookup",
  "attributes": {
    "addresses": "[10.0.0.145:7833]",
    "address": "cadence.service.consul",
    "level": "info",
    "service": "cadence-frontend",
    "logging-call-at": "dns_updater.go:80",
}
{
  "service": "cadence-frontend",
  "message": "Remove stale peers by DNS lookup",
  "attributes": {
    "addresses": "[10.0.0.17:7833]",
    "address": "cadence.service.consul",
    "level": "info",
    "service": "cadence-frontend",
    "logging-call-at": "dns_updater.go:80",
}

Thanks!

🤦 I just realized the recent builds are on a newer version of Golang. Closing!