spiffe/java-spiffe

Long startup times and stale certs being served

evan2645 opened this issue · 12 comments

Ok so... I went on a bit of a wild goose chase over the last two days. Big thanks to @rturner3 for his help in investigating possible causes. I'm opening this issue at his request.

Background

I was setting up a test environment for SPIFFE/SPIRE. This involved a Java workload serving TLS. This workload was based on the grpc Java greeter app, with a small patch applied to do SPIFFE mTLS. See this fork for the diff: https://github.com/rturner3/grpc-java

The latest release of java-spiffe, 0.6.3, was used in this test.

The symptoms

The workload took a very long time to actually start the netty server. Minutes. Often tens of minutes. During this time, the app appeared hung, and the server was not listening on the configured port. Eventually, it would start listening. Log lines placed just prior to the netty server start and just after demonstrated that the code was blocked on starting the netty server.

After it started, it would be well-behaved for a short period of time, after which it would begin serving stale certs. This could be observed by calling the listening port with openssl s_client. As the SVIDs provided by the agent rotated, the Java server would occasionally begin working again, then start failing again with expired certs.

Observations

After decreasing the SVID TTL to 60 seconds, it was noticed that over the course of more than an hour, the Java server logged only two updates from the workload API. In contrast, the SPIRE Agent logged an update every 30 seconds! The agent logs reflected these updates being sent to the Java server, as matched by PID. It was verified that during the period that the Java server was serving expired certs, the SPIRE Agent was returning valid SVIDs.

@rturner3 was unable to replicate this behavior locally. He observed a very significant difference in the number of inodes open against epoll on his system vs the test system. The test system had only a couple references, whereas his local machine had a large number. This called the sanity of epoll into question as epoll is used for Java grpc, which in turn is (probably?) necessary to receive timely updates from the workload API.

Below is an example of Java server log output while SVID TTL was set to 60 seconds. Of note is the delta between the first message and the message stating "Server started":

Jan 16 00:38:34 acme-workload systemd[1]: Started Acme Co software.
Jan 16 00:38:34 acme-workload hello-world-server[16607]: Jan 16, 2021 12:38:34 AM io.grpc.examples.helloworld.HelloWorldServer start
Jan 16 00:38:34 acme-workload hello-world-server[16607]: INFO: Starting server...
Jan 16 01:02:18 acme-workload hello-world-server[16607]: Jan 16, 2021 1:02:18 AM io.spiffe.workloadapi.DefaultX509Source$1 onUpdate
Jan 16 01:02:18 acme-workload hello-world-server[16607]: INFO: Received X509Context update: spiffe://acme-co/server
Jan 16 01:02:19 acme-workload hello-world-server[16607]: Jan 16, 2021 1:02:19 AM io.grpc.examples.helloworld.HelloWorldServer start
Jan 16 01:02:19 acme-workload hello-world-server[16607]: INFO: Server started, listening on 8443
Jan 16 01:30:46 acme-workload hello-world-server[16607]: Jan 16, 2021 1:30:46 AM io.spiffe.workloadapi.DefaultX509Source$1 onUpdate
Jan 16 01:30:46 acme-workload hello-world-server[16607]: INFO: Received X509Context update: spiffe://acme-co/server

In the above case, it took the Java server almost 30 minutes to log receipt of an SVID and "Server started". The "Starting server" log line is emitted immediately prior to the call which starts netty, and the "Server started" line is emitted immediately after.

In trying to isolate the problem, I built a Go server that does the same thing. While so far this server has yet to serve any stale certs, a significantly delayed startup time was observed. From the time that the Go server was invoked to the time that it served the first response, in one case 7 seconds and in another case 18 seconds was observed... both of which are far slower than expected. The client polled at 2 second intervals. It is believed that Go grpc also relies on epoll.

Finally, it should be noted that this condition was observed when using the default ubuntu image on AWS EC2. Nothing unusual was done to this instance, it is fairly vanilla.

What do we do?

I don't know. Being that the instance is the default EC2 Ubuntu image, I suspect that this should be easy to replicate.

One possibility is that this is a passing bug/issue with the current Ubuntu kernel on AWS, or a bad interaction between this particular kernel version and the Java libs, and it will be resolved in due time. In this case, maybe it is not worth any effort to track down.

Another possibility is that it's related to a specific configuration, a recent change in the AWS Ubuntu images or kernel, or something else that might be persistent. In which case, we'll want to know what that is.

And finally, it could be something else entirely. One thing I don't know is if this can be avoided/mitigated in java-spiffe or not... or if it's related to java-spiffe at all. Perhaps the version of grpc that java-spiffe depends on? I have no idea.

I will leave it to the java-spiffe maintainers to decide on whether they'd like to chase this down or not. For now, I have switched to Go, which does display some unexplained behavior but hasn't yet served any expired certs under the exact same conditions and on the same system.

Replicating

I'm happy to share additional information, and even set up a machine on AWS for you if needed for replication. If you're interested in this, please let me know in the next one week before I destroy my patches and environment :)

Thanks for all your hard work maintaining java-spiffe! We love you :-D

Hmm this is interesting, I'll take a look, thanks a lot for bringing this up @evan2645 !!

ok, I was able to reproduce the issue and to track down the origin of the problem: it seems that it is related to the processing we are doing on the X.509-SVID when received by the client, we are verifying that the private key matches the public key in the certificate by generating a challenge and signing with one key and verifying the signature with the other. This operation is taking very long on AWS instances, don't know why, I guess the kind of instances we are using for testing are not very well suited for cryptographic operations.
I'm wondering if we need to do that verification on the library 🤔
I think on the go-spiffe it is done the same verification and it might be as well the cause for the startup delays you are seeing @evan2645

The long delay is originated during the creation of the secure random number generator (RNG), because java is using /dev/random and that blocks if entropy is not up to the mark (on AWS instances there is not much hardware entropy aka enviromental noise). Changing to /dev/urandom the startup time improves significantly. (Use -Djava.security.egd=file:/dev/urandom)

To look at how much entropy is available: cat /proc/sys/kernel/random/entropy_avail (I saw that the lower that number was, the longer it took for the java app to start)

Oh wow... nice work @maxlambrecht, thank you! Was it very hard to reproduce? I am left wondering why we haven't run into this before

The widely accepted stance on /dev/random vs /dev/urandom (as I understand it) is that /dev/urandom is a sufficient source of entropy for cryptographic operations, and it should be preferred over /dev/random in practically all cases. I am not sure why the Java default is /dev/random... some systems don't have /dev/urandom, however I think that's pretty uncommon. Examining the Go source code, it looks like the only platform on which Go falls back to /dev/random is Plan 9 :)

Is it possible for us to override this configurable in java-spiffe, or is that a bad idea?

I will try to apply this change to the system on which I originally witnessed the behavior and see if things clear up.

I guess we didn't see this behavior before because to reproduce it we need to start the java app, stop it, and then start it again, this way the entropy pool gets emptied and then during the second startup it blocks until the entropy is up again.

I think we should document these findings and recommend to use /dev/urandom.

(There is an old issue in java: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4705093, that was resolved suggesting the workaround of using urandom passing a variable to the JVM: -Djava.security.egd=file:/dev/urandom)

I guess we didn't see this behavior before because to reproduce it we need to start the java app, stop it, and then start it again, this way the entropy pool gets emptied and then during the second startup it blocks until the entropy is up again.

Aah yes, that makes sense... Since we do this comparison any time we receive a new X509-SVID, I presume it could also be triggered by "low" TTLs?

I think we should document these findings and recommend to use /dev/urandom.

Agreed. Is it possible for us to specify in the java-spiffe library code that the call we make for entropy should use /dev/urandom by default? It would be great if we could avoid this problem altogether by patching java-spiffe

Aah yes, that makes sense... Since we do this comparison any time we receive a new X509-SVID, I presume it could also be triggered by "low" TTLs?

Yes, low TTLs should exacerbate this problem since the entropy on the system is not enough to satisfy the frequency of SVID updates.

Agreed. Is it possible for us to specify in the java-spiffe library code that the call we make for entropy should use /dev/urandom by default? It would be great if we could avoid this problem altogether by patching java-spiffe

Specifying the source of randomness is a JVM-level security setting that can't be controlled by application code (i.e. programmatically), so if we need to generate secure randomness in java-spiffe, the best we can probably do is to document this potential issue on virtualized platforms as @maxlambrecht suggested.

But I also think there is reason to question why secure random numbers need to be generated when consuming X.509 SVIDs over the Workload API. Is there really any value to generating a digital signature over random message bytes on receiving each refreshed SVID? I can't think of a good reason why the library should be doing this on behalf of the application, especially since signature creation/verification introduces non-trivial computational overhead. Removing that piece of code from java-spiffe would at least prevent this specific issue from occurring in the Workload API handler callback.

low TTLs should exacerbate this problem since the entropy on the system is not enough to satisfy the frequency of SVID updates.

And it's the cause of the other problem as well: serving stale certificates

Removing that piece of code from java-spiffe would at least prevent this specific issue from occurring in the Workload API handler callback.

I agree with removing that verification from the library if there's no way of verifying the private key without generating a random number, like it is done in go-spiffe comparing the keys: https://github.com/spiffe/go-spiffe/blob/8e1adeaada756b9b0b201dab3745aebc7180347c/v2/svid/x509svid/svid.go#L220
When I implemented this verification in java-spiffe I couldn't figure it out how to do it other way.

I think we don't need that verification on the client libraries, I would remove it altogether.

I would like to know what @azdagron thinks about removing this private key verification from the client libraries.

I didn't design the check in go-spiffe as a mitigation for a security concern, per-se, but rather an early detection system that the SPIFFE control plane gave back something nonsensical. I'd be surprised if there is no way to compare the public key of the cert and private key in java but am not familiar with the libraries. I'm happy to help dig there.

ok, @rturner3 and I were looking deeper into the keys verification and haven't found a straightforward way to do it because private key objects in Java don't have any of the public key properties.

Since there is no real requirement around it, we decided to remove the verification that is currently in place using the Random generator and that is causing these long delays and serving stale certificates.

Thank you for your time on this @maxlambrecht and @rturner3! Much appreciated ❤️