vmware-archive/admiral

Powering on several containers post successful upgrade is failing with "Retries are prevented. Failure: javax.net.ssl.SSLHandshakeException: General SSLEngine problem; Reason: {"message":"javax.net.ssl.SSLHandshakeException: General SSLEngine problem","stackTrace":"

lgayatri opened this issue · 13 comments

@lgayatri commented on Thu Mar 15 2018

User Statement:

Powering on several containers post successful upgrade is failing with

image

#For bug reports, please include the information below:

VIC version:
vic-dev-v1.4.0-dev-4071-6f031600.ova

Steps to reproduce:

  1. This OVA was upgraded from 1.3.1 to 1.4.
  2. The Containers which failed to power on, were created on 1.3.1
  3. However when single container is chose to power on, the operation succeeds.

image

Logs:

Will be provided in channel.
@mdubya66

xenonHost.0.zip
Attaching only one admiral log due to size restriction.

Any update on this bug please?

@lazarin @mshipkovenski , hope you can analyze from the logs attached. Please confirm if I can tear down to move to next build. Reproducing this bugs again is very time consuming given that it is hit after scale upgrade.

@lgayatri I checked the machine, but couldn't reproduce the issue. However, some containers failed to start with another error. For example:

docker -H 10.197.37.137:2376 --tls start youthful_stonebraker
Error response from daemon: Server error from portlayer: unable to wait for process launch status: container VM has unexpectedly powered off
Error: failed to start containers: youthful_stonebraker

Out of 50 hosts, 4 are down. Each host except one has the same credentials assigned. BTW, I didn't need to use credentials to connect to any host. The other host (vic-st-h2-179.eng.vmware.com) with the special credentials (ghicken-certs) is no longer available.

I noticed there is an additional credential called "Cert" which is not used by any host. The certificate seems corrupted because I cannot decode it. Do you know why "Cert" was added?

Thats not the error I meant. Please try to multi select about 100 containers and power on. You will see the reported issue.
vic-st-h2-179.eng.vmware.com was deployed for a different customer issue. Not sure about "Cert".

I see now, it's not about the number of containers. If I select even a single container in ERROR state (that is its host is unavailable) the error is reproduced. If I batch-select only stopped containers the problem is not reproducible.

Admiral doesn't show start/stop operations for single containers in ERROR state:

screen shot 2018-03-20 at 18 55 31

We must come up with a solution for multi-selection of containers in any state.

@mshipkovenski , I need the test bed, can you please let me know if I can take it down?

@mshipkovenski I only chose containers in powered off state to report this bug.

@lgayatri , please take it down if you need to.

The problem here is that at least one of the hosts that belonged to a selected container was down. I tried the following on your environment:

  1. Selected a few STOPPED containers and powered them on - the request passed.
  2. Selected about 100 containers in any state. My selection included containers in ERROR state as well.
    The original problem was reproduced.
  3. Selected 100 containers but excluded those in ERROR state. Request passed.

Containers are marked in ERROR state if the host which they belong to is unavailable for data collection (for 3 data collections with a few minutes between each) or container health check has failed. My guess is that even the containers were not marked in ERROR state yet, their hosts were gone.

@lgayatri one more thing, I can reproduce the error with host 10.197.37.180. The certificate imported for it is valid to February 16, 2019 11:36 AM. If I try to configure the host in another Admiral instance, the certificate I get is valid to February 28, 2019 9:39 PM, so the two certificates are different, which explains the error message. Do you have any idea why did that happen?

@mshipkovenski , not sure. Please check with @lazarin

What we know so far:

  • Host 10.197.37.180 was configured in Admiral on Friday, February 16, 2018 10:11:01.889 AM and the certificate was imported on Friday, February 16, 2018 10:11:01.538 AM (it seems during the same "Add host" operation).
  • Then on 2018-02-28T18:06:07.833Z the first SSLHandshakeException appears.
  • The imported certificate is different from the actual host certificate. Somehow it was changed on the host side.
  • The host is in OFF state and all its containers are marked in ERROR state (no data collection has passed since then). The issue is reproducible with each container that belongs to this host.

Based on this, I conclude this is an infrastructure issue.