lyft/metadataproxy

IP Address caching and docker re-using IP addresses may cause inconsistent credentials being returned

justinwatkinson opened this issue · 5 comments

Good evening!

Just wanted to note an issue I was observing earlier today. I'm still doing more testing, but I'll try to detail what I've seen in case others have seen or a fix is on the way.

I am currently trying to use this metadataproxy with the AWS EC2 Container Service. The EC2 environment is bootstrapped with the metadataproxy (v 1.1) and then the ECS agent, which I don't think is material, but full disclosure.

I started with using Docker 1.9, but after reviewing the code, it looks like the metadataproxy is dependent upon the Docker Networking IP Address as part of the docker inspect information. On Docker 1.9.1, that value is often empty, which was giving us some strange behavior.

This afternoon, I upgraded our EC2 instances to Docker 1.11.1. We started to see the IP Address be populated in the Network Configuration section, and we started to see the STS transactions begin to succeed. We were simply starting containers from the command line at this point, a simple alpine container, and passing in the IAM_ROLE environment parameter as the ARN. Once the alpine container started, I'd apk upgrade && apk add curl to install curl, and I would test with "curl 169.254.169.254/latest/meta-data/iam/info".

So for the primary issue. What we began to notice is that with the default Docker networking setup, the IP block was in the 172.16.x.y range, and the gateway was 172.16.0.1. The usable IPs started at .2. When I started the first container A, I started it with Role A, and I started a second container (B) with Role B, and everything worked great. No problems so far.

When we noticed a problem, was when we killed container A (with Role A), and started Container C, using Role B. Docker assigned this container the same IP address (172.16.0.2) as the previous container, and I believe the caching aspect of metadataproxy may be hanging on to the former IAM value.

When testing this, I would repeat the curl 169.254.169.254/latest/meta-data/iam/info repeatedly, but I'd get different instance profiles randomly. I didn't seem to notice a pattern. But if I killed and restarted the metadataproxy container and started over things seem to clear up.

So curious if you guys have run into it. While working with a coworker on this, I think we can maybe add an extra check to see if the IAM role for the container by IP has changed and if so, remove it from cache and start anew, but that's about as far as we got before the end of the day.

Thank you!

Taking a deeper look, I believe the issue I'm seeing is right around her: https://github.com/lyft/metadataproxy/blob/master/metadataproxy/roles.py#L92

if ip in CONTAINER_MAPPING:
        log.info('Container id for IP {0} in cache'.format(ip))
        try:
            with PrintingBlockTimer('Container inspect'):
                container = client.inspect_container(CONTAINER_MAPPING[ip])
            return container
        except docker.errors.NotFound:
            msg = 'Container id {0} no longer mapped to {1}'
            log.error(msg.format(CONTAINER_MAPPING[ip], ip))
            del CONTAINER_MAPPING[ip]

On this line, when it does the inspect_container, the container technically still exists (it's just stopped), so the except case won't get invoked if I'm correct. Probably just need to add a couple of lines to sanity check that the IP address is still assigned to that container upon docker inspect, and that the container is still running. If neither case is true, just let it continue with the remainder of the workflow and update the caches accordingly.

p.s. - looking forward to another pair of eyes on this. I am still pretty new to this one and could be completely wrong

I was just looking at the same lines of code and thinking "I wonder if the container still exists...". It looks like we should be checking the container's state, rather than just relying on it existing or not existing. Good find. I'll try to get a fix in for this soon!

If you'd like to fix this before I get to it, send in a PR. I'd be happy to merge. We do have a CLA that needs to be accepted first though: https://oss.lyft.com/cla

If nobody beats me to it, when I get back from my long weekend, I'll take a stab at it early next week. Thank you!