paws-r/paws

ECS container fails to retrieve secrets from AWS Secret Manager due to hop limit

Closed this issue ยท 25 comments

I am running an R script in a Docker container on an ECS instance, and I am using the paws.security.identity package to retrieve secrets from AWS Secret Manager. However, the container is failing to retrieve secrets from Secret Manager, and the connection seems to hang indefinitely.

After investigating the issue, I discovered that the hop limit setting on the ECS instance was causing requests to the metadata endpoint to fail. Specifically, the hop limit was set to 1, which was preventing the R script from successfully retrieving secrets from Secret Manager.

To resolve the issue, I increased the hop limit from 1 to 2, and the R script was able to successfully retrieve secrets from Secret Manager. It appears that the issue was related to network connectivity and routing, and increasing the hop limit resolved the issue.

I am submitting this issue (which can be closed immediately) in the hope the solution helps someone else.

Steps to Reproduce:

Run an R script in a Docker container on an ECS instance.
Attempt to retrieve secrets from AWS Secret Manager using the paws.security.identity package.
Observe that the connection appears to hang indefinitely, and the container fails to retrieve secrets from Secret Manager.

Expected Result:
The R script should be able to successfully retrieve secrets from AWS Secret Manager using the paws.security.identity package.

Actual Result:
The R script fails to retrieve secrets from Secret Manager, and the connection appears to hang indefinitely.

Solution:
Increase the hop limit setting on the ECS instance to 2. This will allow requests to the metadata endpoint to succeed, and should resolve the issue with retrieving secrets from AWS Secret Manager in a Docker container running on an ECS instance.

I am using terraform to provision these resources, I made the following change to my terraform manifest:

resource "aws_launch_template" "asg_launch_template" {
<snip>
metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"
    http_put_response_hop_limit = 2
    instance_metadata_tags      = "enabled"
  }
}

Not familiar with hop limit, I will have to do some reading up. Do you believe there is some issue with the IP protocol when in ECS tasks?

Possible similar issue with older version of botocore: boto/botocore#1892

Related issue: boto/botocore#1897

I believe it was a result of moving to IMDSv2 here: #552

You can read a bit about the hop limit here: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/instance-metadata-v2-how-it-works.html. Specifically the last paragraph on the page.

I don't think any changes need to be made to paws, or any action taken. I was raising this issue in the hope that someone else taking the same journey I did would find this issue and get a quicker resolution. I went down a red herring of thinking I needed to set up a secrets manager endpoint. :)

We have also encountered this issue with paws, and also curl under the hood: the request would just hang and no response and not respecting the set timeout. Used a lot of time to dig in to understand what was wrong.

For curl you could get around it by setting --max-time, but without it, even setting timeout with --connect-timeout would not be respected. I guess the issue is that it's not a timeout, the request gets an intermediate response, it's just waiting for the final response due to this put response hop limit.

As mentioned above this is related to IMDSv2. paws uses it as first way of trying to get EC2 metadata credentials and then falls back to IMDSv1 if it doesn't work. However, the issue here is that paws gets stuck on IMDSv2 due to the above and is not able to fallback. The set timeout of 1 second by paws is not resepected. AWS CLI does not have the same issues, it cuts the connection if no response in x seconds.

So a possible improvement to paws here would be to extend the calls to the EC2 metadata endpoints with some --max-time or some other property when calling these such that it will correctly fallback to IMDSv1. Checked httr package, no immediate candidate for such property even though curl supports --max-time.

@stuart-storypark's solution anyhow works, setting the metadata properties for EC2 instances.

@joakibo thanks for the extra insight. This might be something we can fix. I don't have an environment to fully test this out yet. If anyone is happy to work with me to test this i would be more than grateful. Or even provide their dockerfile so i can test it in the paws aws account.

Correct me if I am wrong but paws it stuck at this part when it attempts to instance metadata?

https://github.com/paws-r/paws/blob/main/paws.common/R/config.R#L173-L193

@DyfanJones Correct, and as mentioned using curl in the terminal yielded same results:

curl -X PUT --connect-timeout 1 http://169.254.169.254/latest/api/token and it would just hang.
curl -X PUT --connect-timeout 1 --max-time 5 http://169.254.169.254/latest/api/token and it would quit after five seconds as unsuccessful.

Seems like paws gets stuck into the first scenario there. The issue is also explained at https://stackoverflow.com/questions/62324816/ecs-container-hangs-when-calling-ssm-api-endpoint/62326320#62326320 with same resolution as above. For paws then when I tested this interactively I hit ctrl+c and then it continued, fell back to IMDSv1 and it eventually worked. But in a context where no-one could ctrl+c then it would hang indefinitely.

AWS CLI worked fine in the same context.

So, just having some way of getting it to respect timeout when running issue https://github.com/paws-r/paws/blob/main/paws.common/R/config.R#L188C7-L188C36

ECS container on EC2 is possibly sufficient for reproducing the issue, f.ex. with rocker:r-ver or something like that.

After a little more digging it looks like there are many different curl timeout options

library(httr)
resp1 = VERB("GET", "http://httpbin.org", config(timeout_ms = 1000))
#> Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [httpbin.org] Operation timed out after 1002 milliseconds with 0 bytes received
resp2 = VERB("GET", "http://httpbin.org", config(timeout = 1))
#> Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [httpbin.org] Operation timed out after 1001 milliseconds with 0 bytes received
resp3 = VERB("GET", "http://httpbin.org", config(connecttimeout = 1))

Created on 2023-06-26 with reprex v2.0.2

We are currently using connect-timeout similar to above. I wonder if we could resolve this by adding the stand timeout curl option. When calling IMDSv2.

Thanks for some code samples, I'll do a test on my side since I have access to the problematic setup and come back to you if any of those works.

@joakibo created a branch to try and address this. It allows connect_timeout and timeout to be used. Are you able to test this? I would super grateful if you can.

remotes::install_github("DyfanJones/paws", ref="timeout")
image

Results. Of those three only timeout_ms actually resulted in it stopping the request after one second, I had to ctrl+c the others.

ah thanks for this. I can switch from timeout to timeout_ms. I will update the branch accordingly.

Hold on a second, let's see, I may have been too quick there. Seems like both timeout and timeout_ms works (which makes sense, they should be the same just on different scales), while connecttimeout certainly does not work.

I put timeout = 1000 there, so makes sense that it didn't kill it. Setting it to 1 and it works.

image

Blinking cursor of death.

So given this, connecttimeout doesn't kick in since it is able to establish connection but it is supposed to "hop" on further for the put request, and hence just stands there awaiting response.

timeout seems like more general timeouts on receiving the actual end response, same as --max-time for curl.

ah I will revert the last change :P hahaha my bad for being eager :P

Yes you are right, from looking at the libcurl documentation timeout is this: https://curl.se/libcurl/c/CURLOPT_TIMEOUT.html

CURLOPT_TIMEOUT - maximum time the transfer is allowed to complete

Good stuff ๐Ÿ‘

@joakibo if possible are you able to test the branch?

remotes::install_github("DyfanJones/paws", ref="timeout")

I would be super grateful if you are able to :)

Mini test script:

remotes::install_github("DyfanJones/paws", ref="timeout")
s3 <- paws::s3()
s3$list_buckets()

Assuming ecs has s3 list buckets permissions :) If this hangs please let me know :D

Tried now but got this

> remotes::install_github("DyfanJones/paws", ref="timeout")
Error: Failed to install 'unknown package' from GitHub:
  HTTP error 404.
  Not Found

  Did you spell the repo owner (`DyfanJones`) and repo name (`paws`) correctly?
  - If spelling is correct, check that you have the required permissions to access the repo.

It's not public possibly?

EDIT: Tried to setup PAT but that doesn't seem to work. Problem is that this runs in a context where I don't have setup SSH etc. But access is anyway the issue. Will see if I can find a way.

Ah my bad i didn't add the package directory. Try this instead ๐Ÿ˜›

remotes::install_github("DyfanJones/paws/paws.common", ref="timeout")

@DyfanJones Seems like that worked. Tested this first but hanged:
httr::VERB("PUT", "http://169.254.169.254/latest/api/token", config(connecttimeout = 1))

Then after having done remotes::install I tested this paws.security.identity::sts()$get_caller_identity() which worked. Here run with options(paws.log_level = 3

> paws.security.identity::sts()$get_caller_identity()
INFO [2023-06-26 18:37:23.572]: Unable to locate credentials file
INFO [2023-06-26 18:37:23.580]: Unable to get credentials from config file.
INFO [2023-06-26 18:37:23.584]: Unable to obtain access_key_id, secret_access_key or session_token
INFO [2023-06-26 18:37:23.593]: -> PUT /latest/api/token HTTP/1.1
-> Host: 169.254.169.254
-> User-Agent: libcurl/7.68.0 r-curl/5.0.1 httr/1.4.6
-> Accept-Encoding: deflate, gzip, br
-> Accept: application/json, text/xml, application/xml, */*
-> X-aws-ec2-metadata-token-ttl-seconds: 21600
-> Content-Length: 0
->
INFO [2023-06-26 18:37:24.597]: -> GET /latest/meta-data/iam/security-credentials HTTP/1.1
-> Host: 169.254.169.254
-> User-Agent: libcurl/7.68.0 r-curl/5.0.1 httr/1.4.6
-> Accept-Encoding: deflate, gzip, br
-> Accept: application/json, text/xml, application/xml, */*
->
INFO [2023-06-26 18:37:24.598]: <- HTTP/1.1 200 OK
INFO [2023-06-26 18:37:24.598]: <- Content-Type: text/plain
INFO [2023-06-26 18:37:24.598]: <- Accept-Ranges: none
INFO [2023-06-26 18:37:24.598]: <- Last-Modified: Mon, 26 Jun 2023 17:50:37 GMT
INFO [2023-06-26 18:37:24.598]: <- Content-Length: 64
INFO [2023-06-26 18:37:24.598]: <- Date: Mon, 26 Jun 2023 18:37:24 GMT
INFO [2023-06-26 18:37:24.599]: <- Server: EC2ws
INFO [2023-06-26 18:37:24.599]: <- Connection: close
INFO [2023-06-26 18:37:24.599]: <-
INFO [2023-06-26 18:37:24.613]: -> PUT /latest/api/token HTTP/1.1
-> Host: 169.254.169.254
-> User-Agent: libcurl/7.68.0 r-curl/5.0.1 httr/1.4.6
-> Accept-Encoding: deflate, gzip, br
-> Accept: application/json, text/xml, application/xml, */*
-> X-aws-ec2-metadata-token-ttl-seconds: 21600
-> Content-Length: 0
->
INFO [2023-06-26 18:37:25.618]: -> GET /latest/meta-data/iam/security-credentials/[masked] HTTP/1.1
-> Host: 169.254.169.254
-> User-Agent: libcurl/7.68.0 r-curl/5.0.1 httr/1.4.6
-> Accept-Encoding: deflate, gzip, br
-> Accept: application/json, text/xml, application/xml, */*
->
INFO [2023-06-26 18:37:25.619]: <- HTTP/1.1 200 OK
INFO [2023-06-26 18:37:25.619]: <- Content-Type: text/plain
INFO [2023-06-26 18:37:25.619]: <- Accept-Ranges: none
INFO [2023-06-26 18:37:25.620]: <- Last-Modified: Mon, 26 Jun 2023 17:50:37 GMT
INFO [2023-06-26 18:37:25.620]: <- Content-Length: 1582
INFO [2023-06-26 18:37:25.620]: <- Date: Mon, 26 Jun 2023 18:37:25 GMT
INFO [2023-06-26 18:37:25.620]: <- Server: EC2ws
INFO [2023-06-26 18:37:25.620]: <- Connection: close
INFO [2023-06-26 18:37:25.620]: <-
INFO [2023-06-26 18:37:25.873]: -> POST / HTTP/1.1
.
.
.

This is great news it is working :) I will merge the PR for the next paws.commom release

Thanks for all the testing and the extra insight, this issue wouldn't of been resolved without your help

Sure, no problems, happy to assist ๐Ÿ‘