aws-greengrass/aws-greengrass-nucleus

(ServerCertificateGenerator): lagged update of broker's certificate

Closed this issue · 1 comments

Describe the bug
During run a lot of test we sometimes observe cases when MQTT client can't connect to EMQX broker due to TLS error.

It happened when Nucleus was just deployed EQMX, Auth (and in some case IPDetector) and we are ensure EQMX is fully started, broker's certificate has been generated and even connectivity information of the broker has been published to IoT Core.
Client do discovery, obtain valid list of IP addresses from IoT Core but TLS handshake failed due to broker's cerificate missing IP addresses in SubjectAlternativeName of cert which contains only "localhost".
It looks like ConnectivityUpdater publishes real IP addresses, but ServerCertificateGenerator generates cert only for "localhost".
After some time ServerCertificateGenerator generates new certificate with all IPs published by ConnectivityUpdater plus "localhost" but sometimes it takes more than 65 seconds.

Similar situation happened when IPDetector used with includeIPv4LoopbackAddrs and includeIPv4LinkLocalAddrs because after above scenario in additional we also observe more IP addresses updates and ServerCertificateGenerator lagged from ConnectivityUpdater.

Looks we have some time gap between ConnectivityUpdater published address information to the IoT Core and ServerCertificateGenerator generate certificate with these IPs for the broker.
During that time gap clients will obtain correct connectivity information of broker but connections will failed due to client can't ensure broker's is who he claims to be.

To Reproduce

  1. Do deployment with EMQX, Auth, optionally IPDetector component
  2. Just after deployment ensure EMQX is started and starting MQTT client
  3. Client do discovery of EMQX addressees and try connect on those addresses to the broker

Expected behavior
The MQTT client should establish MQTT over TLS connection with broker on addresses taken from by discovery procedure from IoT Core.

Actual behavior
Sometimes connection attempt failed due to TLS issues. Some MQTT libraries show only generic error but some details like "certificate verify failed" or "No subject alternative names matching IP address 172.25.32.127 found"

When use IPDetector with options includeIPv4LoopbackAddrs and includeIPv4LinkLocalAddrs set to true increase probability of the issue.

Environment

  • OS: Ubuntu 22.04
  • JDK version: JDK 11
  • Nucleus version: 2.11.0
  • EMQX version: 1.2.3

Additional context
I guess it related to asynchronous nature of Nucleus services. Probably even the same event triggering both ServerCertificateGenerator and ConnectivityUpdater one of them stuck may be on a busy executor for example. Another possible reason different events triggering these services and one delays from first.

This issue does not belong here Vitaly.

Also, as we discussed on slack, everything is working as intended.