heroku/heroku-kong

Cluster members loose communication (SSL error in keep alive)

mars opened this issue · 4 comments

mars commented

The following error appears in the logs when the app is scaled beyond a single dyno:

2016/02/08 22:46:39 [error] 91#0: [lua] cluster.lua:84: Cassandra error: 10.1.16.105, context: ngx.timer

Uncertain if the error causes problems with the runtime. The /cluster API status appears healthy and Kong proxy services requests as expected.

The issue was originally opened against Kong itself, but later found that the error is only reproducible with this app.

mars commented

Update: the Kong cluster looses cohesion after several days. A restart fixes it, but then will regress again within a few days. Even though all of the Kong instances are still running, their Admin /cluster API only lists a single node (the instance itself.) Suspecting this Cassandra error is from the cluster "keep alive" code.

mars commented

Thanks @thibaultcha for the lead on improved Cassandra error messages. Here's what we see now:

2016/04/25 23:26:22 [error] 109#0: [lua] cluster.lua:80: Cassandra error: Error during SSL handshake with host at 10.1.46.97:9042: 18: self signed certificate, context: ngx.timer

Seems strange, since Cassandra/SSL works fine in other contexts. It's just these cluster timers that loose the certificate somehow. Any idea why a self-signed cert is assumed here?

mars commented

Finally found the underlying bug, and fortunately the fix is in ngx_lua master for the next release 0.10.3.

mars commented

No longer an issue with Kong 0.11