grzm/awyeah-api

aws calls hang when running in CI

Closed this issue · 9 comments

I can use this lib for all operations on my local machine. Thanks for creating it.

When running in CI (bitbucket docker) all calls block and don't return, eventually hitting CI timeout.

I discovered this calling STS but then switched to S3 :ListBuckets: same behaviour

Is there some kind of logging or other diagnostic tool I can use to debug this?

here's log from CI that installs bb with deps in case it helps...

  • ./scripts/ensure-bb.sh 0.8.156
    missing. Installing...
    /opt/atlassian/pipelines/agent/build
    ./bb.tar.gz: 70.7% -- replaced with ./bb.tar
    bb
    Could not find /root/.deps.clj/1.11.1.1113/ClojureTools/clojure-tools-1.11.1.1113.jar
    Downloading tools jar from https://download.clojure.org/install/clojure-tools-1.11.1.1113.zip to /root/.deps.clj/1.11.1.1113/ClojureTools
    Cloning: https://github.com/grzm/awyeah-api
    Checking out: https://github.com/grzm/awyeah-api at a3ce8c5
    Cloning: https://github.com/babashka/spec.alpha
    Checking out: https://github.com/babashka/spec.alpha at 433b0778e2c32f4bb5d0b48e5a33520bee28b906
    Downloading: com/cognitect/aws/sts/822.2.1145.0/sts-822.2.1145.0.pom from central
    Downloading: com/cognitect/aws/endpoints/1.1.12.230/endpoints-1.1.12.230.pom from central
    Downloading: com/cognitect/aws/lambda/822.2.1145.0/lambda-822.2.1145.0.pom from central
    Downloading: org/clojure/clojure/1.11.1/clojure-1.11.1.pom from central
    Downloading: com/cognitect/aws/s3/822.2.1145.0/s3-822.2.1145.0.pom from central
    Downloading: com/cognitect/aws/cloudfront/822.2.1145.0/cloudfront-822.2.1145.0.pom from central
    Downloading: org/clojure/spec.alpha/0.3.218/spec.alpha-0.3.218.pom from central
    Downloading: org/clojure/core.specs.alpha/0.2.62/core.specs.alpha-0.2.62.pom from central
    Downloading: org/clojure/pom.contrib/1.1.0/pom.contrib-1.1.0.pom from central
    Downloading: org/babashka/cli/0.2.22/cli-0.2.22.pom from clojars
    Downloading: funcool/promesa/8.0.450/promesa-8.0.450.pom from clojars
    Downloading: com/cognitect/aws/endpoints/1.1.12.230/endpoints-1.1.12.230.jar from central
    Downloading: com/cognitect/aws/lambda/822.2.1145.0/lambda-822.2.1145.0.jar from central
    Downloading: org/clojure/core.specs.alpha/0.2.62/core.specs.alpha-0.2.62.jar from central
    Downloading: org/clojure/spec.alpha/0.3.218/spec.alpha-0.3.218.jar from central
    Downloading: org/clojure/clojure/1.11.1/clojure-1.11.1.jar from central
    Downloading: com/cognitect/aws/sts/822.2.1145.0/sts-822.2.1145.0.jar from central
    Downloading: org/babashka/cli/0.2.22/cli-0.2.22.jar from clojars
    Downloading: com/cognitect/aws/s3/822.2.1145.0/s3-822.2.1145.0.jar from central
    Downloading: funcool/promesa/8.0.450/promesa-8.0.450.jar from clojars
    Downloading: com/cognitect/aws/cloudfront/822.2.1145.0/cloudfront-822.2.1145.0.jar from central
    babashka v0.8.156

bb is operating normally i.e. steps prior to aws calls are ok e.g. using org.babashka/cli to parse args

My next step would be to reproduce in a local docker container but then I'll need some way to dig into the aws blocking calls.

one thing worth noting is that I'm using newer aws versions that you list in the docs. this is because the STS lib didn't offer a version matching the ones listed so I upgraded them all to 822.2.1145.0

more info: in AWS I can see that the credentials have never been used i.e. the call did not complete the authentication phase. I guess this means the block is in the creds chain provider?

Tried latest commit/sha. same block.

Tried removing creds from aws i.e. keys not active. same block.

Tried removing keys env from env. Got following logs...
testing aws..
2022-06-25T00:45:22.787Z cef7caac-cc7e-433b-9ee4-eadd58adbf33-mlgfs INFO [com.grzm.awyeah.credentials:?] - Unable to fetch credentials from environment variables.
2022-06-25T00:45:22.790Z cef7caac-cc7e-433b-9ee4-eadd58adbf33-mlgfs INFO [com.grzm.awyeah.credentials:?] - Unable to fetch credentials from system properties.
..but then same block

grzm commented

There was an issue in the http-client which wasn't properly handling http-client exceptions. I could trigger the hanging behavior you describe independent of any container if the default credentials provider falls through to the instance-profile-credentials-provider.

(defn default-credentials-provider
"Returns a chain-credentials-provider with (in order):
environment-credentials-provider
system-property-credentials-provider
profile-credentials-provider
container-credentials-provider
instance-profile-credentials-provider
Alpha. Subject to change."
[http-client]
(chain-credentials-provider
[(environment-credentials-provider)
(system-property-credentials-provider)
(profile-credentials-provider)
(container-credentials-provider http-client)
(instance-profile-credentials-provider http-client)]))

If you don't have a EC2 metadata host available (e.g., you're not running in EC2), the http-client would throw, the exception wouldn't be properly handled, and it would hang.

I suspect this is what's happening in your case as well. The fix in 0fa7dd5 resolved the issue I was seeing, and I think it'll likely fix what you're seeing as well, or at least not hang.

If you're continuing to see an issue, please provide a small, isolated, reproducible test case that exhibits the error you're seeing. If you're only seeing the behavior in a container, a minimal Dockerfile would be really helpful as well. FWIW, I tried to fetch https://bitbucket.org/nextdoc/nd-client-editor/src/954eaf5a57658eb5a61573894c7949b32cafdb20/scripts/ensure-bb.sh but got "Repository Not Found."

If you're having issues with credentials specifically, you supply one of the credentials providers directly rather than relying on the default, or you can create your own credentials provider.

Hopefully this helps. Let me know if it doesn't.

grzm commented

I'll close this issue. If you do have further problems, just open another one with a reproducible test case. Thanks!

thanks. I'll try it out when I next work on CI and will confirm fix back here

finally got back to this and can confirm that assume-role now works by creating a reified CredentialsProvider.

thanks for the fix