uswitch/kiam

kiam-server broken in v4.1

mestuddtc opened this issue · 5 comments

I had a working configuration using v4.0, but v4.1 is broken. I am running outside of AWS, so autodetecting roles is not an option.

These are the server parameters, working in v4.0.

      containers:
      - args:
        - server
        - --json-log
        - --level=debug
        - --bind=0.0.0.0:443
        - --cert=/etc/kiam/tls/server.pem
        - --key=/etc/kiam/tls/server-key.pem
        - --ca=/etc/kiam/tls/ca.pem
        - --role-base-arn=arn:aws:iam::682359534587:role/
        - --sync=1m
        - --prometheus-listen-addr=0.0.0.0:9620
        - --prometheus-sync-interval=5s

With v4.1 or master, the server logs:

{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2021-07-21T19:39:51Z"}
{"level":"info","msg":"starting server","time":"2021-07-21T19:39:51Z"}
{"level":"fatal","msg":"error using AWS STS Gateway: role can't be empty","time":"2021-07-21T19:39:51Z"}

Took a look at this one. This change is responsible: 8103483

I unsure if this is a bug. Before the change above it was assumed server pods were running on a node with the appropriate IAM permissions; the STS gateway config would just not provide any credentials. Now it appears --assume-role-arn must be provided and set to a role the node can assume. That role must also have the necessary permissions for everything to work.

If this is the intended behavior then the docs should be updated to reflect that and the helm chart should be updated to require a value for assumeRoleArn because just providing a role base ARN (either via --role-base-arn or --role-base-arn-autodetect) is not enough now.

I would consider it a bug for two reasons:

  1. it is an incompatible, breaking change in a minor update. If it was intended, the version should have been 5.0, should it not?
  2. the credentials are being provided to kiam-server via IRSA annotations (earlier I think access key and secret), which explicitly give the permissions to do its work. It doesn't make sense to me that I would have to add another layer of IAM roles and permissions.

I agree with you on both points. I don't like to think that both of us were leveraging undefined behavior that just happened to work in v4.0.

Hopefully a KIAM dev can chime in on whether 8103483 unintentionally broke our use case. Fixing it wouldn't be difficult; just check if b.config.AssumeRoleArn is empty and skip trying to resolve the role if so.

Same issue after upgrading to v4.2, but the server is running inside AWS:

{"level":"info","msg":"starting server","time":"2022-02-23T19:15:04Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2022-02-23T19:15:04Z"}
{"level":"info","msg":"detecting arn prefix","time":"2022-02-23T19:15:04Z"}
{"level":"info","msg":"using detected prefix: arn:aws:iam::xxxxxxxxxxxx:role/","time":"2022-02-23T19:15:04Z"}
{"level":"fatal","msg":"error using AWS STS Gateway: role can't be empty","time":"2022-02-23T19:15:04Z"}

Config:

    spec:
      containers:
      - args:
        - --json-log
        - --level=info
        - --bind=0.0.0.0:443
        - --cert=/etc/kiam/tls/tls.crt
        - --key=/etc/kiam/tls/tls.key
        - --ca=/etc/kiam/tls/ca.crt
        - --role-base-arn-autodetect
        - --session-duration=15m
        - --sync=1m
        - --prometheus-listen-addr=0.0.0.0:9620
        - --prometheus-sync-interval=5s
        command:
        - /kiam
        - server

Downgrading to v4.0 resolves the error:

{"level":"info","msg":"starting server","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"detecting arn prefix","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"using detected prefix: arn:aws:iam::xxxxxxxxxxxx:role/","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"detecting arn prefix","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"using detected prefix: arn:aws:iam::xxxxxxxxxxxx:role/","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"will serve on 0.0.0.0:443","time":"2022-02-23T19:17:20Z"}

We "fixed" this by setting server.assumeRoleArn in the helm chart to the IAM role our nodes already as. Silly but it does restore behavior to what it was in 4.0.