jetstack/navigator

Stalled ElasticSearch Upgrade

cehoffman opened this issue · 0 comments

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

Upgrade from 6.2.4 to 6.3.0 elasticsearch stalled with last two data and ingest pods unupgraded. The 3 masters upgraded then pods 4, 3, and 2 of the data and ingest upgraded. 1 and 0 did not upgrade and the UpdateVersion loop in navigator controller stopped.

What you expected to happen:

All pods upgraded.

How to reproduce it (as minimally and precisely as possible):

Create a 5 member data and ingest pool and a 3 member master pool at 6.2.4 with 0.1.0 navigator.

Anything else we need to know?:

It appears there was a mixup in pilot updating the version of elasticsearch. See https://gist.github.com/34927d24d0056967aba99c2f5a29ba7e

The d-0 an d-1 pilots indicates they are running the 6.3.0 elasticsearch but they never changed images. Seconds prior to this gist capture (while the upgrade was in the stalled state) pilots d-0 an d-3 indicated they had version 6.2.4. d-0 would be correct, but d-3 was using 6.3.0 image.

It appears there is misalignment in updating or detecting the elasticsearch version of the pilot record.

The events in the describe summary for the cluster are:

Events:
  Type     Reason            Age                From                  Message
  ----     ------            ----               ----                  -------
  Normal   UpdateVersion     40m (x2 over 1h)   navigator-controller  Updating replica es-logging-master-1 to version 6.3.0
  Warning  ErrUpdateVersion  37m (x8 over 37m)  navigator-controller  Pilot "es-logging-master-1" has not finished updating to version "6.3.0"
  Normal   UpdateVersion     31m (x3 over 1h)   navigator-controller  Updating replica es-logging-master-2 to version 6.3.0
  Normal   UpdateVersion     29m                navigator-controller  Updated node pool "master" to version "6.3.0"
  Normal   UpdateVersion     24m                navigator-controller  Updating replica es-logging-d-2 to version 6.3.0

There are a number of failures on the master upgrade because using ES_JAVA_OPTS doesn't work as an override in the container image with 0.1.0 pilot. I had to change to setting the min/max heap in the jvm.options file.

Environment:

  • Kubernetes version (use kubectl version): 1.9.6
  • Cloud provider or hardware configuration**: Azure
  • Install tools: Helm
  • Others: