spiffe/java-spiffe

Cache JwtSvid

IvMdlc opened this issue · 6 comments

Hi, every call to fetchJwtSvid

    public JwtSvid fetchJwtSvid(String audience, String... extraAudiences) throws JwtSvidException {
        if (isClosed()) {
            throw new IllegalStateException("JWT SVID source is closed");
        }
        return workloadApiClient.fetchJwtSvid(audience, extraAudiences);
    }

results in a workload attestor request and ultimately a FetchJWTSVID method call on the agent. Workload attestation is expensive, and so probably it'd be good to cache the JwtSvid based on its expiry field.

Hi @IvMdlc,

the FetchJWTSVID method takes audiences as parameters, thus it needs to be called each time you need a JWT token with a different audience. That's why the SPIRE Workload API method rpc FetchJWTSVID doesn't return a stream response, unlike the rpc FetchX509SVID which does return a stream and thus we can have an X509Source that maintains the SVID in memory and keeps receiving updates from the SPIRE Agent, without more attestations.
Moreover, the default_jwt_svid_ttl is 5 minutes, so by default every (less than) 5 minutes a new call to the WorkloadAPI would still to be done for every JWT SVID with a different audience needed, involving a attestation on the SPIRE Agent side.
I think we would not gain much from adding a cache to the library.

@rturner3, any thoughts?

Hi @maxlambrecht thanks, we see that an agent (v1.6.1) with 10 workloads mapped onto it can easily hit 100% CPU usage on this kind of hardware

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
Stepping:              1
CPU MHz:               2349.023
CPU max MHz:           3400.0000
CPU min MHz:           1200.0000
BogoMIPS:              4788.74
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39

if workloads make 50 fetchJwtSvid calls per second, which is not many (for our use case, at least)

Edit: yes, fetchJwtSvid with same audience, that's why we consider the caching mechanism.

I agree that caching in-application can have some benefits when JWT-SVIDs are requested frequently since workload attestation can take over one second in some cases. That is a heavy cost to pay per-request.

It's worth mentioning that if you're using SPIRE as your SPIFFE implementation (probably the case for most users today), SPIRE Agent already internally caches JWT-SVIDs. Some users may not want to cache the same JWT-SVID in two processes on the same host, so I think if we had support for this in java-spiffe, we'd probably want to make it an optional feature. It could also result in a non-trivial amount of memory to be used by the library that may not be a tradeoff that some users want to make. The latter case is less likely given the size of JWT-SVIDs but still a consideration.

I could see this being doable by creating a new JwtSource implementation, e.g. CachedJwtSource. That way it's up to the user to explicitly decide to use the implementation of JwtSource that best fits their needs.

I see. I agree that based on that use case and those numbers it makes sense to offer some caching mechanism on the library, and also agree that it should optional.
I'll work on an implementation and send a PR.

I sent a PR with the implementation of a CachedJwtSource, but there's something I'm not sure about: for how long should the JWT SVIDs be cached? I made it so they are returned from the cache if they still have half their lifetime, following what SPIRE Agent does for the rotation of the X.509-SVIDs. Does it make sense? Should it be configurable? What would be a sensible default?

@rturner3
@IvMdlc

@maxlambrecht That's great, many thanks for implementing it!

Yes, as a user it makes sense to me. As an aside, in SPIRE the workload/agent/server’s SVID rotation occurs at 50% TTL and is not configurable - saying this from looking at the code, as it’s not in the docs nor in the tutorials, as far as I can tell?

Regarding the agent, it caches both X.509-SVID and JWT-SVID via the agent’s Manager, and is not configurable either:

cachedSVID, ok := m.cache.GetJWTSVID(spiffeID, audience)
	if ok && !rotationutil.JWTSVIDExpiresSoon(cachedSVID, now) {
		return cachedSVID, nil
	}

https://github.com/spiffe/spire/blob/main/pkg/agent/manager/manager.go#L65-L67
https://github.com/spiffe/spire/blob/main/pkg/agent/manager/manager.go#L234-L237

And the rotation logic with half TTL

func shouldRotate(now, beginTime, expiryTime time.Time) bool {
	ttl := expiryTime.Sub(now)
	lifetime := expiryTime.Sub(beginTime)
	return ttl <= lifetime/2
}

https://github.com/spiffe/spire/blob/main/pkg/common/rotationutil/rotationutil.go#L21-L43

For our use case, caching JWT-SVID and rotating at 50% TTL works fine.

Thanks.