falcosecurity/falcoctl

[falcoctl artifact follow]: can't handle or refresh ECR token after initial artifact pull

Closed this issue · 8 comments

What happened:

falcoctl artifact follow is not able to refresh an ECR auth token when working with amazon-ecr-credential-helper binary.

What you expected to happen:
falcoctl artifact follow is able to leverage amazon-ecr-credential-helper binary correctly to auth against ECR while following an artifact since 12 hours after initial auth against ECR.

How to reproduce it (as minimally and precisely as possible):
Environment: AWS EKS, falco deployed with a Helm chart. Latest version of amazon-ecr-credential-helper added to both falcoctl-artifact-install and falcoctl-artifact-follow containers. IAM role for falco service account is bound to the pod with all necessary permissions to pull the artifact from ECR, as well as to get auth token (ecr:GetAuthorizationToken to *), no static credentials and tokens are in place.

falcoctl-artifact-install initContainer is able to pull an artifact. falcoctl-artifact-follow sidecar container is following an artifact, and can pull it during 12 hours. After 12 hours falcoctl-artifact-follow can't pull the artifact:

 INFO   (<REDACTED>.dkr.ecr.eu-west-1.amazonaws.com/falco-rules:master) fetching descriptor from remote repository...
 INFO   (<REDACTED>.dkr.ecr.eu-west-1.amazonaws.com/falco-rules:master) descriptor correctly fetched
 INFO   (<REDACTED>.dkr.ecr.eu-west-1.amazonaws.com/falco-rules:master) nothing to do, artifact already up to date.
 INFO   (ghcr.io/falcosecurity/rules/falco-rules:1) fetching descriptor from remote repository...
 INFO   (ghcr.io/falcosecurity/rules/falco-rules:1) descriptor correctly fetched
 INFO   (ghcr.io/falcosecurity/rules/falco-rules:1) nothing to do, artifact already up to date.
 INFO   (<REDACTED>.dkr.ecr.eu-west-1.amazonaws.com/falco-rules:master) fetching descriptor from remote repository...
 ERRO   (<REDACTED>.dkr.ecr.eu-west-1.amazonaws.com/falco-rules:master) an error occurred while fetching descriptor from remote repository: GET "https://<REDACTED>.dkr.ecr.eu-west-1.amazonaws.com/v2/falco-rules/manifests/master": response status code 403: denied: Your authorization token has expired. Reauthenticate and try again.

What's interesting, that when I exec into falcoctl-artifact-follow container, I'm able to pull an artifact even after 12 hours with /usr/bin/falcoctl-bin registry pull <REDACTED>.dkr.ecr.eu-west-1.amazonaws.com/falco-rules:master, so it means that at least registry pull can work correctly with amazon-ecr-credential-helper besides the fact, that initial auth was 12 hours ago (default and maximum time of ECR token lifecycle before expiration), and can retrieve a new token via amazon-ecr-credential-helper:

 INFO  Preparing to pull artifact "<REDACTED>.dkr.ecr.eu-west-1.amazonaws.com/falco-rules:master"
 INFO  Pulling artifact in the current directory
 INFO  Pulling 44136fa355b3: ############################################# 100% 
 INFO  Pulling d4d8c15d06f0: ############################################# 100% 
 INFO  Pulling bc5e5c05124d: ############################################# 100% 
 INFO  Artifact of type "rulesfile" pulled. Digest: "sha256:bc5e5c05124ded85ee425a8108c1d2ef13dc2165f0534123243564abb834d963"

Configuration that I use:
.docker/config.json:

{"credHelpers": {"<REDACTED>.dkr.ecr.eu-west-1.amazonaws.com": "ecr-login"}}

Caching of token is disabled, sdk load config as well. I've tried with cache and sdk load config enabled as well - behaviour for falcoctl-artifact-follow is still the same:

                    - name: AWS_ECR_DISABLE_CACHE
                      value: "true"
                    - name: AWS_SDK_LOAD_CONFIG
                      value: "false"

AWS IAM role policy for service account:

    Statement = [
      {
        Action = [
          "ecr:BatchCheckLayerAvailability",
          "ecr:BatchDeleteImage",
          "ecr:BatchGetImage",
          "ecr:CompleteLayerUpload",
          "ecr:DescribeImages",
          "ecr:DescribeRegistry",
          "ecr:DescribeRepositories",
          "ecr:GetDownloadUrlForLayer",
          "ecr:GetRegistryPolicy",
          "ecr:GetRepositoryPolicy",
          "ecr:InitiateLayerUpload",
          "ecr:ListImages",
          "ecr:ListTagsForResource",
          "ecr:PutImage",
          "ecr:TagResource",
          "ecr:UntagResource",
          "ecr:UploadLayerPart",
        ]
        Effect   = "Allow"
        Resource = "arn:aws:ecr:eu-west-1<REDACTED>:repository/falco-rules"
      },
      {
        Action = [
          "ecr:GetAuthorizationToken"
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]

~/.ecr/log/ecr-login.log. I'm expecting at least some records there as well when falcoctl-artifact-follow is expected to retrieve new token after 12 hours:

ecr-login.log
time="2023-09-19T21:12:26Z" level=debug msg="Cache disabled due to AWS_ECR_DISABLE_CACHE"
time="2023-09-19T21:12:26Z" level=debug msg="Retrieving credentials" region=eu-west-1 registry=<REDACTED> serverURL=318522186253.dkr.ecr.eu-west-1.amazonaws.com service=ecr
time="2023-09-19T21:12:26Z" level=debug msg="Calling ECR.GetAuthorizationToken" registry=<REDACTED>
time="2023-09-20T11:09:10Z" level=debug msg="Cache disabled due to AWS_ECR_DISABLE_CACHE"
time="2023-09-20T11:09:10Z" level=debug msg="Retrieving credentials" region=eu-west-1 registry=<REDACTED> serverURL=318522186253.dkr.ecr.eu-west-1.amazonaws.com service=ecr
time="2023-09-20T11:09:10Z" level=debug msg="Calling ECR.GetAuthorizationToken" registry=<REDACTED>

I've tried all possible combinations of .docker/config.json, env vars like AWS_ECR_DISABLE_CACHE and AWS_SDK_LOAD_CONFIG - no luck.

Regarding the fact that falcoctl registry pull can correctly leverage amazon-ecr-credential-helper even after 12 hours and pull an artifact (and it means that leveraging of amazon-ecr-credential-helper for falcoctl registry pull works as expected) , I assume that the issue is behind falcoctl artifact follow controller / method specifically.

@alacuku Would appreciate any suggestions and solutions there. Thanks in advance!

Hi @CarpathianUA, thanks for the detailed issue.

Falcoctl caches the authentication token, and it seems that is not able to refresh it after the expiration. By disabling the internal cache it works, but at every request it will authenticate to the remote repository. The authentication is based on https://github.com/oras-project/oras-credentials-go module. I will take some time to investigate further.

Hi @alacuku , thank you for the quick reply! I'll really appreciate this needed functionality to be implemented in a next falcoctl releases.

Hey @CarpathianUA, the fix is here #326.

I tested it locally, but would really appreciate it if you could test it in your environment.

@alacuku I can confirm that in the latest falcoctl release all works as expected, thank you!

@CarpathianUA CarpathianUA

May I ask how did You modify the helm chart in order to add the amazon-ecr-credential-helper for installer and follower containers? Did You fork/clone the repo and modified?

@CarpathianUA CarpathianUA

May I ask how did You modify the helm chart in order to add the amazon-ecr-credential-helper for installer and follower containers? Did You fork/clone the repo and modified?

Hi, you don't need to modify anything - official chart allows to add init and sidecars containers, as well as to add custom volumes and volume mounts. So you can download ecr binary with init container under the path with shared volume, and then mount this volume to install and follow containers.

@CarpathianUA - My faloctl was able to pull rules from ECR but after some time, I started getting errors with empty credentials to ECR. After recreating Service Account and restarting pod, it started working. Is it possible that it is connected to the issue mentioned above by You?

on which level did You setup the AWS_ECR_DISABLE_CACHE? Just on for falcoctl or for artifacts or for config?

@robert-pudlowski-mox
I've set

                    - name: AWS_ECR_DISABLE_CACHE
                      value: "true"
                    - name: AWS_SDK_LOAD_CONFIG
                      value: "false"

for both follow and install containers