lyft/metadataproxy

Add a gevent pool for refreshing STS assumed credentials

ryan-lane opened this issue · 7 comments

The metadata proxy can know when IAM credentials are about to expire. We should add a gevent pool that runs occasionally, checks to see if any credentials need to be renewed, and renew them before they expire. The goal is to remove the STS assume from the critical path of the application, as the STS assume can be a bit slow.

This is a very nice project!

When creating our Amazon Federation Proxy, which provides IAM credentials to all servers in an on-premise data center (and people too), we noticed that most AWS SDKs assume that they get IAM credentials from the EC2 metadata service within 1 second.

To comply with this requirement we created afp-alppaca as a sidecar service which is very similar to your metadataproxy. However, it implements pre-fetching and cacheing of the IAM credentials to

  1. guarantee a valid credential response within 1 second
  2. allow the backend server afp-core to have a downtime. If the downtime is less than ~30 minutes then nobody will be affected by that.

Maybe you can copy some ideas or code from there to solve this issue.

The code does currently cache credentials, so once a role is fetched via STS it'll return well within 1s. This issue is describing what you describe otherwise. When credentials are about to expire, the proxy itself should renew so that containers don't need to refetch.

prefetching is... difficult, because you don't know what roles you'll need to fetch. If you know which roles are going to be ahead of time it's definitely possible as an end-user to prewarm the cache by running a few containers that just curl the IAM endpoints before starting any other containers.

The 1-second limit will hit you exactly on the first request. That is also the request by which the SDK decides if to use EC2 metadata service credentials or to try other sources of credentials. I am not sure the SDKs would make that check and decision more than once if the first attempt failed.

IMHO one can assume that the target-role does not change at run-time. In our use-case the target role depends on the IP of a server (similar to your Docker IP lookup). In your case the IAM_ROLE environment variable should stay the same while a Docker container runs.

I guess you could iterate over all the Docker containers and fetch the IAM credentials for them even before a container asks for credentials. If it asks for them you can reply from the cache. That approach would also allow you to skip the IP lookup on access as you could rely on the cached data from when you iterated over all containers.

swipely/iam-docker follows the Docker event stream to observe container creation so it can fetch credentials before the first credentials request arrives; see docker/event_handler.go. The credential store additionally retains credentials and refreshes prior to expiration.

Ah. Interesting. Nice. I'll have to apply that.

I've been running into this issue lately - CI jobs failing about 20% of the time due to the initial credential request taking longer than 1 second. I can add an initial request to prewarm the cache, but it'd be nice if metadataproxy would handle that on its own.

Thank you for you contribution to this repository.

Closing this contribution as this repository is being archived.