Kong/kong

Upstream continuously flip flops between HEALTHY and UNHEALTHY when DNS TTL = 0

chris-branch opened this issue · 15 comments

Summary

If an upstream target is configured with a domain name where the DNS lookup for that domain name results in a TTL value of 0, Kong will continuously toggle the upstream state between Unhealthy and then Healthy again every second. This results in a large number of error log entries even when using the default log level (NOTICE). Example log output down below.

In our situation, we have several upstreams for internal services that are intentionally configured with TTL = 0 for load balancing/failover purposes that we are not able to modify.

Steps To Reproduce

To reproduce this issue, you'll need a domain name that has TTL = 0. I put together a test case that runs a CoreDNS server in a container in order to satisfy this requirement for testing.

  1. Clone this repo: https://github.com/chris-branch/kong-dnsbug
  2. Run ./docker-up.sh to get everything up and running (CoreDNS, Cassandra database, Kong migrations + bootstrap, Kong)
  3. Execute the following curl commands to add one upstream, and one target for that upstream:
curl --location --request POST 'http://localhost:8001/upstreams' \
--header 'Content-Type: application/json' \
--data-raw '{
    "id": "58c8ccbb-eafb-4566-991f-2ed4f678fa70",
    "name": "example-upstream",
    "algorithm": "round-robin",
    "hash_on": "none",
    "hash_fallback": "none",
    "hash_on_cookie_path": "/",
    "slots": 1000,
    "host_header": "example.com"
}'

curl --location --request POST 'http://localhost:8001/upstreams/58c8ccbb-eafb-4566-991f-2ed4f678fa70/targets' \
--header 'Content-Type: application/json' \
--data-raw '{
    "id": "a3395f66-2af6-4c79-bea2-1b6933764f80",
    "target": "example.app.internal:3000",
    "weight": 100
}'
  1. View Kong's error log (docker logs -f kong) and note the upstream continually flip-flops between Unhealthy and Healthy. Also note that DNS entries are repeatedly added/removed from the cache.

Additional Details & Logs

Example abbreviated log output:

2020/01/21 05:50:02 [debug] 35#0: *20777 [lua] base.lua:1205: cb_expire(): [ringbalancer 1] executing requery for: example.app.internal
2020/01/21 05:50:02 [debug] 35#0: *20777 [lua] base.lua:550: queryDns(): [ringbalancer 1] querying dns for example.app.internal
2020/01/21 05:50:02 [debug] 35#0: *20777 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, 33 -> 1
2020/01/21 05:50:02 [debug] 35#0: *20777 [lua] base.lua:222: disable(): [ringbalancer 1] disabling address: example.app.internal:3000 (host example.app.internal)
2020/01/21 05:50:02 [debug] 35#0: *20777 [lua] base.lua:246: change(): [ringbalancer 1] changing address weight: example.app.internal:3000(host example.app.internal) 100 -> 0
2020/01/21 05:50:02 [debug] 35#0: *20777 [lua] base.lua:330: newAddress(): [ringbalancer 1] new address for host 'example.app.internal' created: 192.168.65.2:3000 (weight 100)
2020/01/21 05:50:02 [debug] 35#0: *20777 [lua] base.lua:529: f(): [ringbalancer 1] updating balancer based on dns changes for example.app.internal
2020/01/21 05:50:02 [debug] 35#0: *20777 [lua] ring.lua:238: redistributeIndices(): [ringbalancer 1] redistributed indices, size=1000, dropped=1000, assigned=1000, left unassigned=0
2020/01/21 05:50:02 [debug] 35#0: *20777 [lua] base.lua:235: delete(): [ringbalancer 1] deleting address: example.app.internal:3000 (host example.app.internal)
2020/01/21 05:50:02 [debug] 35#0: *20777 [lua] base.lua:539: f(): [ringbalancer 1] querying dns and updating for example.app.internal completed
2020/01/21 05:50:02 [warn] 35#0: *20781 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to UNHEALTHY, context: ngx.timer
2020/01/21 05:50:02 [warn] 35#0: *20782 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer
2020/01/21 05:50:02 [debug] 35#0: *20783 [lua] healthcheck.lua:1068: log(): [healthcheck] (example-upstream) adding an existing target: example.app.internal 192.168.65.2:3000 (ignoring)
2020/01/21 05:50:02 [debug] 37#0: *20787 [lua] base.lua:1205: cb_expire(): [ringbalancer 1] executing requery for: example.app.internal
2020/01/21 05:50:02 [debug] 37#0: *20787 [lua] base.lua:550: queryDns(): [ringbalancer 1] querying dns for example.app.internal
2020/01/21 05:50:02 [debug] 37#0: *20787 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, 33 -> 1
2020/01/21 05:50:02 [debug] 37#0: *20787 [lua] base.lua:222: disable(): [ringbalancer 1] disabling address: example.app.internal:3000 (host example.app.internal)
2020/01/21 05:50:02 [debug] 37#0: *20787 [lua] base.lua:246: change(): [ringbalancer 1] changing address weight: example.app.internal:3000(host example.app.internal) 100 -> 0
2020/01/21 05:50:02 [debug] 37#0: *20787 [lua] base.lua:330: newAddress(): [ringbalancer 1] new address for host 'example.app.internal' created: 192.168.65.2:3000 (weight 100)
2020/01/21 05:50:02 [debug] 37#0: *20787 [lua] base.lua:529: f(): [ringbalancer 1] updating balancer based on dns changes for example.app.internal
2020/01/21 05:50:02 [debug] 37#0: *20787 [lua] ring.lua:238: redistributeIndices(): [ringbalancer 1] redistributed indices, size=1000, dropped=1000, assigned=1000, left unassigned=0
2020/01/21 05:50:02 [debug] 37#0: *20787 [lua] base.lua:235: delete(): [ringbalancer 1] deleting address: example.app.internal:3000 (host example.app.internal)
2020/01/21 05:50:02 [debug] 37#0: *20787 [lua] base.lua:539: f(): [ringbalancer 1] querying dns and updating for example.app.internal completed
2020/01/21 05:50:02 [warn] 37#0: *20791 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to UNHEALTHY, context: ngx.timer
2020/01/21 05:50:02 [warn] 37#0: *20792 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer
2020/01/21 05:50:02 [debug] 37#0: *20793 [lua] healthcheck.lua:1068: log(): [healthcheck] (example-upstream) adding an existing target: example.app.internal 192.168.65.2:3000 (ignoring)
2020/01/21 05:50:02 [debug] 34#0: *20801 [lua] base.lua:1205: cb_expire(): [ringbalancer 1] executing requery for: example.app.internal
2020/01/21 05:50:02 [debug] 34#0: *20801 [lua] base.lua:550: queryDns(): [ringbalancer 1] querying dns for example.app.internal
2020/01/21 05:50:02 [debug] 34#0: *20801 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, 33 -> 1
2020/01/21 05:50:02 [debug] 34#0: *20801 [lua] base.lua:222: disable(): [ringbalancer 1] disabling address: example.app.internal:3000 (host example.app.internal)
2020/01/21 05:50:02 [debug] 34#0: *20801 [lua] base.lua:246: change(): [ringbalancer 1] changing address weight: example.app.internal:3000(host example.app.internal) 100 -> 0
2020/01/21 05:50:02 [debug] 34#0: *20801 [lua] base.lua:330: newAddress(): [ringbalancer 1] new address for host 'example.app.internal' created: 192.168.65.2:3000 (weight 100)
2020/01/21 05:50:02 [debug] 34#0: *20801 [lua] base.lua:529: f(): [ringbalancer 1] updating balancer based on dns changes for example.app.internal
2020/01/21 05:50:02 [debug] 34#0: *20801 [lua] ring.lua:238: redistributeIndices(): [ringbalancer 1] redistributed indices, size=1000, dropped=1000, assigned=1000, left unassigned=0
2020/01/21 05:50:02 [debug] 34#0: *20801 [lua] base.lua:235: delete(): [ringbalancer 1] deleting address: example.app.internal:3000 (host example.app.internal)
2020/01/21 05:50:02 [debug] 34#0: *20801 [lua] base.lua:539: f(): [ringbalancer 1] querying dns and updating for example.app.internal completed
2020/01/21 05:50:02 [warn] 34#0: *20805 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to UNHEALTHY, context: ngx.timer
2020/01/21 05:50:02 [warn] 34#0: *20806 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer
2020/01/21 05:50:02 [debug] 34#0: *20807 [lua] healthcheck.lua:1068: log(): [healthcheck] (example-upstream) adding an existing target: example.app.internal 192.168.65.2:3000 (ignoring)
2020/01/21 05:50:02 [debug] 36#0: *20811 [lua] base.lua:1205: cb_expire(): [ringbalancer 1] executing requery for: example.app.internal
2020/01/21 05:50:02 [debug] 36#0: *20811 [lua] base.lua:550: queryDns(): [ringbalancer 1] querying dns for example.app.internal
2020/01/21 05:50:02 [debug] 36#0: *20811 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, 33 -> 1
2020/01/21 05:50:02 [debug] 36#0: *20811 [lua] base.lua:222: disable(): [ringbalancer 1] disabling address: example.app.internal:3000 (host example.app.internal)
2020/01/21 05:50:02 [debug] 36#0: *20811 [lua] base.lua:246: change(): [ringbalancer 1] changing address weight: example.app.internal:3000(host example.app.internal) 100 -> 0
2020/01/21 05:50:02 [debug] 36#0: *20811 [lua] base.lua:330: newAddress(): [ringbalancer 1] new address for host 'example.app.internal' created: 192.168.65.2:3000 (weight 100)
2020/01/21 05:50:02 [debug] 36#0: *20811 [lua] base.lua:529: f(): [ringbalancer 1] updating balancer based on dns changes for example.app.internal
2020/01/21 05:50:02 [debug] 36#0: *20811 [lua] ring.lua:238: redistributeIndices(): [ringbalancer 1] redistributed indices, size=1000, dropped=1000, assigned=1000, left unassigned=0
2020/01/21 05:50:02 [debug] 36#0: *20811 [lua] base.lua:235: delete(): [ringbalancer 1] deleting address: example.app.internal:3000 (host example.app.internal)
2020/01/21 05:50:02 [debug] 36#0: *20811 [lua] base.lua:539: f(): [ringbalancer 1] querying dns and updating for example.app.internal completed
2020/01/21 05:50:02 [warn] 36#0: *20815 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to UNHEALTHY, context: ngx.timer
2020/01/21 05:50:02 [warn] 36#0: *20816 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer
2020/01/21 05:50:02 [debug] 36#0: *20817 [lua] healthcheck.lua:1068: log(): [healthcheck] (example-upstream) adding an existing target: example.app.internal 192.168.65.2:3000 (ignoring)
2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] base.lua:1205: cb_expire(): [ringbalancer 1] executing requery for: example.app.internal
2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] base.lua:550: queryDns(): [ringbalancer 1] querying dns for example.app.internal
2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] base.lua:432: f(): [ringbalancer 1] ttl=0 detected for example.app.internal
2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, 1 -> 33
2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] base.lua:222: disable(): [ringbalancer 1] disabling address: 192.168.65.2:3000 (host example.app.internal)
2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] base.lua:246: change(): [ringbalancer 1] changing address weight: 192.168.65.2:3000(host example.app.internal) 100 -> 0
2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] base.lua:330: newAddress(): [ringbalancer 1] new address for host 'example.app.internal' created: example.app.internal:3000 (weight 100)
2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] base.lua:529: f(): [ringbalancer 1] updating balancer based on dns changes for example.app.internal
2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] ring.lua:238: redistributeIndices(): [ringbalancer 1] redistributed indices, size=1000, dropped=1000, assigned=1000, left unassigned=0
2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] base.lua:235: delete(): [ringbalancer 1] deleting address: 192.168.65.2:3000 (host example.app.internal)
2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] base.lua:539: f(): [ringbalancer 1] querying dns and updating for example.app.internal completed
2020/01/21 05:50:03 [warn] 32#0: *20825 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to UNHEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 05:50:03 [warn] 32#0: *20826 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 05:50:03 [debug] 32#0: *20827 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=lua-resty-healthcheck [example-upstream], event=healthy, pid=32, data=table: 0x7f60ab93dc50
2020/01/21 05:50:03 [debug] 32#0: *20827 [lua] healthcheck.lua:1068: log(): [healthcheck] (example-upstream) event: target added 'example.app.internal(example.app.internal:3000)'
2020/01/21 05:50:03 [debug] 32#0: *20827 [lua] healthcheck.lua:1068: log(): [healthcheck] (example-upstream) event: target status 'example.app.internal(example.app.internal:3000)' from 'false' to 'true'
2020/01/21 05:50:03 [debug] 32#0: *20828 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=lua-resty-healthcheck [example-upstream], event=remove, pid=32, data=table: 0x7f60ab93eb28
2020/01/21 05:50:03 [debug] 32#0: *20828 [lua] healthcheck.lua:1068: log(): [healthcheck] (example-upstream) event: target 'example.app.internal (192.168.65.2:3000' removed

Kong version: 1.4.3 (also see the same issue with 1.4.2)
Operating system: Docker running on Mac OS 10.14.6 (Mojave) and also Pivotal Cloud Foundry

this seems to be a snippet from the log. Can you provide the start of the log?

I would expect to see some (or 1) of these log messages: https://github.com/Kong/lua-resty-dns-client/blob/master/src/resty/dns/balancer/base.lua#L432-L433

The log snippet above does include the "ttl=0 detected" message. See:

2020/01/21 05:50:03 [debug] 32#0: *20819 [lua] base.lua:432: f(): [ringbalancer 1] ttl=0 detected for example.app.internal

I've also attached a full log showing startup to the first 5 minutes. It appears the health check issue is repeating once per minute in this log. In our production environment, it seems to occur more frequently.
errorlog.log

reducing that log file to 33#0 gives me:

2020/01/21 16:30:13 [debug] 33#0: *1986 [lua] base.lua:432: f(): [ringbalancer 1] ttl=0 detected for example.app.internal
2020/01/21 16:31:16 [debug] 33#0: *3674 [lua] base.lua:432: f(): [ringbalancer 1] ttl=0 detected for example.app.internal
2020/01/21 16:32:18 [debug] 33#0: *5331 [lua] base.lua:432: f(): [ringbalancer 1] ttl=0 detected for example.app.internal
2020/01/21 16:33:20 [debug] 33#0: *6985 [lua] base.lua:432: f(): [ringbalancer 1] ttl=0 detected for example.app.internal

Which is once per minute. Which makes sense since TTL_0_RETRY is set to 60 seconds here: https://github.com/Kong/lua-resty-dns-client/blob/master/src/resty/dns/balancer/base.lua#L129

@chris-branch It should be the same in your production environment, can you check that? Please note that this is per nginx worker, so like I did above (selecting only 33#0), you have to reduce it to a single worker.

this is interesting:

2020/01/21 16:30:13 [debug] 33#0: *1986 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, nil -> 33
2020/01/21 16:31:14 [debug] 33#0: *3578 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, 33 -> 1
2020/01/21 16:31:16 [debug] 33#0: *3674 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, 1 -> 33
2020/01/21 16:32:16 [debug] 33#0: *5231 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, 33 -> 1
2020/01/21 16:32:18 [debug] 33#0: *5331 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, 1 -> 33
2020/01/21 16:33:18 [debug] 33#0: *6891 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, 33 -> 1
2020/01/21 16:33:20 [debug] 33#0: *6985 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for example.app.internal, 1 -> 33

For ttl=0 we inject a "fake" SRV record (type 33). After the TTL_0_RETRY period (60 seconds) we check whether the record still has ttl=0. But then it flips record type, to an A record (type 1), only to immediately (1 or 2 secs) be flipped back to a "fake" SRV again.

There's two questions here;

  1. why does it flip record types (I have a hunch about that one)
  2. why does flipping the record type cause a health-status failure?

ad2) flipping a record type means we consider it a completely new record. We remove that record and all it's addresses (and also the associated healthchecks). And then add the new record (and its associated healthchecks). But why would anything be unhealthy here?

Looking at the logs again;

2020/01/21 16:30:13 [warn] 33#0: *1991 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:31:14 [warn] 33#0: *3581 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to UNHEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:31:14 [warn] 33#0: *3582 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:31:16 [warn] 33#0: *3677 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to UNHEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:31:16 [warn] 33#0: *3678 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:32:16 [warn] 33#0: *5234 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to UNHEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:32:16 [warn] 33#0: *5235 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:32:18 [warn] 33#0: *5334 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to UNHEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:32:18 [warn] 33#0: *5335 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:33:18 [warn] 33#0: *6894 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to UNHEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:33:18 [warn] 33#0: *6895 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:33:20 [warn] 33#0: *6988 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to UNHEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001
2020/01/21 16:33:20 [warn] 33#0: *6989 [lua] balancer.lua:242: callback(): [healthchecks] balancer example-upstream reported health status changed to HEALTHY, context: ngx.timer, client: 172.20.0.1, server: 0.0.0.0:8001

It seems that each flip of record type is followed by an UNHEALTHY, and right after a HEALTHY event. Which leads me to believe the entry that is being removed is marked unhealthy, and the new entry being added is marked as healthy. (due to the underlying problem they just happen to be the same IP+Port combination)

So maybe that part is still ok. Thoughts on this @hishamhm @locao ?

Leaves problem 1 for now.

I think the problem was introduced here: https://github.com/Kong/lua-resty-dns-client/pull/56/files

@chris-branch can you try and change this line:

  if (newQuery[1] or empty).ttl == 0 and ((oldQuery[1] or empty).ttl or 0) == 0 then

(here: https://github.com/Kong/lua-resty-dns-client/blob/master/src/resty/dns/balancer/base.lua#L416)

into

  if (newQuery[1] or empty).ttl == 0 and (
     ((oldQuery[1] or empty).ttl or 0) == 0 or 
       oldQuery[1].__ttl0Flag )then

and see if the problem goes away?

@Tieske - I modified the if statement in base.lua:

  if (newQuery[1] or empty).ttl == 0 and                                                 
     (((oldQuery[1] or empty).ttl or 0) == 0 or oldQuery[1].__ttl0Flag) then   

However, that didn't seem to help. I still see the upstream flip-flopping between Unhealthy/Healthy, and the log looks mostly the same. New log attached here:
errorlog_0122_0143.log

Looking at the code, one thing I wondered about with the prior fix for the Route 53 issue is that it's keeping track of the lastQuery and comparing that against the newQuery. The assumption seems to be that DNS queries will not happen more than 1 time per second per host. But is that a valid assumption? queryDns() is called on a 1-second timer, but it's also called by addressStillValid() which is called from getPeer(). So, it seems like queryDns() can potentially be called much more frequently.

Regarding your question about the frequency in our production environment ... I think what I'm seeing matches up with what you said (1 occurrence per worker, per upstream, every 1 minute). However, I can't get the exact numbers because our production environment runs 8 instances of Kong across 2 datacenters and all of the error logs are aggregated to a common location, so the same PIDs can occur across instances, which makes it difficult to filter logs down to one specific worker on one specific instance. During a 2-minute period, I collected 600 UNHEALTHY/HEALTHY transitions across the entire cluster, which seems about right as a ballpark. It's definitely not 1 per second, per worker, per upstream like I originally thought. It just seems that way due to the number of instances + workers.

@chris-branch Sorry, my bad, I made a mistake, the __ttl0flag is set on the top level record, not on the individual entries. So the [1] should be removed. It should have been:

  if (newQuery[1] or empty).ttl == 0 and                                                 
     (((oldQuery[1] or empty).ttl or 0) == 0 or oldQuery.__ttl0Flag) then

Can you give that a try?

@Tieske - No worries. I applied the new code change, and that does appear to resolve the issue, at least with my small test scenario. I no longer see Unhealthy/Healthy flip flopping. I do see DNS being re-queried every 1 second, per worker, per upstream, but that appears to be intentional. Ideally, it would only do the timer-based query if/when the upstream is idle for 1 second. Since a DNS query already happens on every request when TTL = 0, the timer-based DNS query is sort of redundant if the upstream is serving requests frequently, but I don't think that's a huge issue.

I've attached the latest log in case you'd like to review: errorlog_0123_0245.log

Thanks for your help troubleshooting this!

the queries should run once per minute, not once per second, so that seems to be another separate (?) issue.

When a ttl=0 is encountered we inject a fake SRV record with a TTL of 60 seconds. This explicitly to update that record (for internal balancer purposes) only once per minute. Essentially we're only checking whether the ttl is still 0 or maybe it was changed.

This line sets the expire time of the injected record:
https://github.com/Kong/lua-resty-dns-client/blob/master/src/resty/dns/balancer/base.lua#L444

ttl0Interval is 60 seconds, so that record should only expire after 60 seconds.

The timer (runs once per second) checks for expired records here:

https://github.com/Kong/lua-resty-dns-client/blob/6c1bb6c0056bd6a8344b013faef4bc484d4d53fc/src/resty/dns/balancer/base.lua#L1204

Somehow the timer considers the record expired on each run.

Ok, I think I found it.

The first time the name is resolved for worker 34#0:

2020/01/23 02:45:39 [debug] 34#0: *13 [lua] base.lua:540: f(): [ringbalancer 1] querying dns and updating for example.app.internal completed

The first requery log:

2020/01/23 02:46:40 [debug] 34#0: *1606 [lua] base.lua:1206: cb_expire(): [ringbalancer 1] executing requery for: example.app.internal
2020/01/23 02:46:40 [debug] 34#0: *1606 [lua] base.lua:551: queryDns(): [ringbalancer 1] querying dns for example.app.internal

Looking at the timestamps there is 1 minute in between. So initially it works. The problem is that the expire field of the injected DNS record never gets updated, hence it remains in the past, and on every check it will then initiate a new update query.

@chris-branch can you give this another try?

On top of the previous change, add these lines:

    if oldQuery.__ttl0Flag then
      -- still ttl 0 so nothing changed
      oldQuery.touched = time()                                             --> Added
      oldQuery.expire = time() + self.balancer.ttl0Interval                 --> Added
      ngx_log(ngx_DEBUG, self.log_prefix, "no dns changes detected for ",
              self.hostname, ", still using ttl=0")
      return true
    end

@Tieske - The code in your PR looks good. I applied those changes (including the suggested change from @bungle) and executed my test setup. The log is clean, other than DNS refreshes every minute (as intended). New log file attached for your review: errorlog_0124_0203.log

I'm guessing this will be included in a 1.4.4 and/or 1.5.1 release at some point. Any idea how far in the future that might be?

@chris-branch awesome! thx so much for testing this.