canonical/zookeeper-k8s-operator

Upgrade and pod rescheduling failing with TLS

Closed this issue · 1 comments

During upgrade tests, we noticed that tls-related files are not correctly created during pod rescheduling, therefore not allowing cluster to recover.

Steps to reproduce

  1. build from 5d9bb11e61174f4680ff9effeb56c7be18b03c18
  2. juju deploy ./zookeeper-k8s_ubuntu-22.04-amd64.charm -n 3 --trust --resource zookeeper-image=ghcr.io/canonical/charmed-zookeeper@sha256:dbdbd8367bf6d813b9aae1e15a6c1743f909db7555a47995b6b5d259e87f2af1
  3. juju deploy self-signed-certificates
  4. juju relate zookeeper-k8s self-signed-certificates
  5. juju run zookeeper-k8s/leader pre-upgrade-check --format yaml
Running operation 1 with 1 task
  - task 2 on unit-zookeeper-k8s-0

Waiting for task 2...
zookeeper-k8s/0:
  id: "2"
  results:
    return-code: 0
  status: completed
  timing:
    completed: 2024-02-16 11:52:50 +0000 UTC
    enqueued: 2024-02-16 11:52:48 +0000 UTC
    started: 2024-02-16 11:52:48 +0000 UTC
  unit: zookeeper-k8s/0
  1. juju refresh zookeeper-k8s --path ./zookeeper-k8s_ubuntu-22.04-amd64.charm

Expected behavior

Upgrades works file and the units recovers from pod rescheduling

Actual behavior

Juju status at the end:

...
Unit                         Workload  Agent  Address      Ports  Message
self-signed-certificates/0*  active    idle   10.1.63.209
zookeeper-k8s/0*             active    idle   10.1.63.227
zookeeper-k8s/1              active    idle   10.1.63.232
zookeeper-k8s/2              blocked   idle   10.1.63.231
...

Juju debug-log:

unit-zookeeper-k8s-2: 11:56:14 INFO unit.zookeeper-k8s/2.juju-log Running legacy hooks/upgrade-charm.
unit-zookeeper-k8s-2: 11:56:16 INFO unit.zookeeper-k8s/2.juju-log zookeeper-k8s/2 initializing...
unit-zookeeper-k8s-2: 11:56:17 INFO unit.zookeeper-k8s/2.juju-log zookeeper-k8s/2 started
unit-self-signed-certificates-0: 11:56:24 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-zookeeper-k8s-1: 11:56:41 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-zookeeper-k8s-2: 11:58:00 ERROR unit.zookeeper-k8s/2.juju-log Not all application units are connected and broadcasting in the quorum
unit-zookeeper-k8s-2: 11:58:00 CRITICAL unit.zookeeper-k8s/2.juju-log Unit failed to upgrade and requires manual rollback to previous stable version.
    1. Re-run `pre-upgrade-check` action on the leader unit to enter 'recovery' state
    2. Run `juju refresh` to the previously deployed charm revision
unit-zookeeper-k8s-2: 11:58:00 INFO juju.worker.uniter.operation ran "upgrade-charm" hook (via hook dispatching script: dispatch)
unit-zookeeper-k8s-2: 11:58:00 INFO juju.worker.uniter found queued "config-changed" hook

Zookeeper logs show:

2024-02-16T12:07:59.538Z [zookeeper] 12:07:59.538 [QuorumConnectionThread-[myid=3]-25] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager - Opening channel to server 2
2024-02-16T12:07:59.538Z [zookeeper] 12:07:59.538 [QuorumConnectionThread-[myid=3]-25] WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open secure channel to 2 at election address zookeeper-k8s-1.zookeeper-k8s-endpoints/10.1.63.232:3888
2024-02-16T12:07:59.538Z [zookeeper] org.apache.zookeeper.common.X509Exception$SSLContextException: Failed to create KeyManager
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.common.X509Util.createSSLContextAndOptionsFromConfig(X509Util.java:371)
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.common.X509Util.createSSLContextAndOptions(X509Util.java:349)
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.common.X509Util.createSSLContextAndOptions(X509Util.java:303)
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.common.X509Util.getDefaultSSLContextAndOptions(X509Util.java:283)
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.common.X509Util.createSSLSocket(X509Util.java:574)
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:379)
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458)
2024-02-16T12:07:59.538Z [zookeeper]    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
2024-02-16T12:07:59.538Z [zookeeper]    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
2024-02-16T12:07:59.538Z [zookeeper]    at java.base/java.lang.Thread.run(Thread.java:833)
2024-02-16T12:07:59.538Z [zookeeper] Caused by: org.apache.zookeeper.common.X509Exception$KeyManagerException: java.io.FileNotFoundException: /etc/zookeeper/keystore.p12 (No such file or directory)
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.common.X509Util.createKeyManager(X509Util.java:492)
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.common.X509Util.createSSLContextAndOptionsFromConfig(X509Util.java:369)
2024-02-16T12:07:59.538Z [zookeeper]    ... 9 common frames omitted
2024-02-16T12:07:59.538Z [zookeeper] Caused by: java.io.FileNotFoundException: /etc/zookeeper/keystore.p12 (No such file or directory)
2024-02-16T12:07:59.538Z [zookeeper]    at java.base/java.io.FileInputStream.open0(Native Method)
2024-02-16T12:07:59.538Z [zookeeper]    at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
2024-02-16T12:07:59.538Z [zookeeper]    at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.common.StandardTypeFileKeyStoreLoader.loadKeyStore(StandardTypeFileKeyStoreLoader.java:53)
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.common.X509Util.loadKeyStore(X509Util.java:425)
2024-02-16T12:07:59.538Z [zookeeper]    at org.apache.zookeeper.common.X509Util.createKeyManager(X509Util.java:481)
2024-02-16T12:07:59.538Z [zookeeper]    ... 10 common frames omitted
root@zookeeper-k8s-2:/# cd /etc/zookeeper/

Versions

Operating system: Ubuntu 22.04 LTS

Juju CLI: 3.1.7

Juju agent: 3.1.7

Charm revision: 41 upgrade to 50

microk8s: 1.29-strict/stable

installed:               v1.29.0             (6370) 168MB -