aws/aws-cdk

custom_resources: Provider Lambda function is missing lambda:GetFunctionConfiguration

erwaxler opened this issue · 12 comments

Describe the bug

The Landing Zone Accelerator solution leverages the custom_resources module to create service-linked roles via CDK custom resources. When this custom resource Lambda function is invoked several times in succession, users intermittently receive the following error:

Received response status [FAILED] from custom resource. Message returned: AccessDeniedException: Resource is not in the state functionActive

We believe this is the result of queuing incoming requests and the role attached to the cdk.custom_resources.Provider function is missing the permission: lambda:GetFunctionConfiguration

Expected Behavior

Custom resource provider implements appropriate permissions and retries to execute successfully when invoked several times in succession.

Current Behavior

Transient failures:

Received response status [FAILED] from custom resource. Message returned: AccessDeniedException: Resource is not in the state functionActive

Reproduction Steps

Deploy v1.4.3 of the Landing Zone Accelerator on AWS.

For a smaller sample that can be extracted without deploying the entire LZA solution, you may use this custom resource construct that is used by LZA to create the service-linked roles:

https://github.com/awslabs/landing-zone-accelerator-on-aws/blob/1614a01824c5a43f97fadfb8ec0c3627a0f343dd/source/packages/%40aws-accelerator/constructs/lib/aws-iam/service-linked-role.ts#L87

Possible Solution

Add lambda:GetFunctionConfiguration permission to the provider Lambda function's IAM role.

Additional Information/Context

No response

CDK CLI Version

2.79

Framework Version

No response

Node.js Version

16.20.1

OS

Amazon Linux

Language

Typescript

Language Version

No response

Other information

No response

pahud commented

this is probably related to #24358

The custom resource essentially check the functionActive state before each invocation:

/**
* The status of the Lambda function is checked every second for up to 300 seconds.
* Exits the loop on 'Active' state and throws an error on 'Inactive' or 'Failed'.
*
* And now we wait.
*
* Use functionActive instead of functionActiveV2, since functionActiveV2 is only
* available on SDK 2.1080.0 and up, Lambda installs 2.1055.0 by default,
* and we use the SDK version that Lambda includes by default.
*/
await waitUntilFunctionActive({
client: lambda,
maxWaitTime: 60,
}, {
FunctionName: req.FunctionName,
});
return await lambda.invoke(req);
}
}

But it should work as expected.

Which region did you deploy?

Instead of running the LZA, are you able to provide a smallest code snippet that reproduces this issue?

@pahud Agreed on it likely being related to #24358, I'll work on a smaller snippet to reproduce the error. The error has been seen in at least us-east-1 and ap-southeast-2, but we've heard reports from 5+ customers so I believe the error to be region-agnostic. I'll work on a smaller snippet to reproduce the behavior more predictably.

@pahud Still working on a smaller reproducible snippet. LZA creates 6 individual custom resources to create the service-linked roles.

I've added some more details below, please advise whether it's still necessary to provide a reproducible code-snippet.

This issue is exacerbated by logic that may run these custom resources on every pipeline run, for instance: awslabs/landing-zone-accelerator-on-aws#237

The following line and the surrounding retry logic requires the as mentioned lambda:GetFunctionConfiguration permission to function: https://github.com/aws/aws-cdk/blob/c695b6004219426cf0e67cbb92d916a394ddd594/packages/aws-cdk-lib/custom-resources/lib/provider-framework/runtime/outbound.ts#L66C15-L66C15

However only invokeFunction is granted on the onEventHandler Lambda function:

I believe the above provider.ts likely needs a fn.addToRolePolicy added after the above line granting "lambda:GetFunctionConfiguration" on this.onEventHandler.functionArn.

Encountered same problem with EKS Blueprint, which is using kubectlProvider created from https://github.com/aws/aws-cdk/blob/v2.96.0/packages/aws-cdk-lib/aws-eks/lib/kubectl-provider.ts#L144.

The error message on Cloudformation said:

Custom::AWSCDK-EKS-KubernetesResource EKSStackAwsAuthmanifest75D20040 CREATE_FAILED Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"} at checkExceptions ...

Also found the following error message in CloudTrail:

User: arn:aws:sts::[redacted]:assumed-role/EKSStack-awscdka-ProviderframeworkonEventS-duAmMCwNZO6z/EKSStack-awscdka-ProviderframeworkonEvent-4NeK4zj7Z6ab is not authorized to perform: lambda:GetFunctionConfiguration on resource: arn:aws:lambda:[redacted]:[redacted]:function:EKSStack-awscdkawseksKube-Handler886CB40B-3fOpwrZnomNI because no identity-based policy allows the lambda:GetFunctionConfiguration action

After added lambda:GetFunctionConfiguration permision to IAM role of the function, the template can be deployed.

The waiter call changed in c3a4b7b from waitUntilFunctionActive to waitUntilFunctionActiveV2. This changed the required IAM permission from lambda:GetFunctionConfiguration to lambda:GetFunction.

/**
* The status of the Lambda function is checked every second for up to 300 seconds.
* Exits the loop on 'Active' state and throws an error on 'Inactive' or 'Failed'.
*
* And now we wait.
*/
await waitUntilFunctionActiveV2({
client: lambda,
maxWaitTime: 300,
}, {
FunctionName: req.FunctionName,
});
return await lambda.invoke(req);

@ejt4x It looks like both PR's failed to merge based on missing tests?

#27204
#27524

I couldn't see a passing PR for this?

@markhankins Yes, it looks like the original PR was abandoned. I am not prepared to create one anytime soon.

I was just commenting to inform any would-be submitters or reviewers that the API call and therefore required fix have changed somewhat since the original title and description of this issue were written.

any solution for this

probably related to #24358

@blinkdaffer @ejt4x @markhankins

Can you tell me which region(s) are you seeing this error?

Are you able to provide a very tiny CDK app that we can deploy in that region and reproduce this error?

@pahud Here's my simplest way of producing this error.

const thisLambdaDoesNotExist = Function.fromFunctionName(this, 'NonExistentLambda', 'fakelambda');

const provider = new Provider(this, 'Provider', {
  onEventHandler: thisLambdaDoesNotExist,
});

new CustomResource(this, 'Resource1', { serviceToken: provider.serviceToken });

The actual exception (throttling, function does not exist, whatever) is swallowed by the try/except block, leaving the following error on the CFN event log:

Resource1	
CREATE_FAILED
Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"} at checkExceptions (/var/runtime/node_modules/@aws-sdk/node_modules/@smithy/util-waiter/dist-cjs/index.js:59:26) at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/index.js:5933:49) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async defaultInvokeFunction (/var/task/outbound.js:1:875) at async invokeUserFunction (/var/task/framework.js:1:2192) at async onEvent (/var/task/framework.js:1:369) at async Runtime.handler (/var/task/cfn-response.js:1:1573) 

If we go digging in CloudTrail, we find this IAM error

   "errorMessage": "User: arn:aws:sts::[redacted]-ProviderframeworkonEvent-jKCdLDqBfAP0 is not authorized to perform: lambda:GetFunction on resource: arn:aws:lambda:us-west-2:redacted:function:fakelambda because no identity-based policy allows the lambda:GetFunction action",

All these errors mask the actual issue - the user lambda invocation failed due to throttling, non-existence, or some other reason. The missing IAM permission prevents this from being discovered by the user