custom_resources: Provider Lambda function is missing lambda:GetFunctionConfiguration
erwaxler opened this issue · 12 comments
Describe the bug
The Landing Zone Accelerator solution leverages the custom_resources module to create service-linked roles via CDK custom resources. When this custom resource Lambda function is invoked several times in succession, users intermittently receive the following error:
Received response status [FAILED] from custom resource. Message returned: AccessDeniedException: Resource is not in the state functionActive
We believe this is the result of queuing incoming requests and the role attached to the cdk.custom_resources.Provider
function is missing the permission: lambda:GetFunctionConfiguration
Expected Behavior
Custom resource provider implements appropriate permissions and retries to execute successfully when invoked several times in succession.
Current Behavior
Transient failures:
Received response status [FAILED] from custom resource. Message returned: AccessDeniedException: Resource is not in the state functionActive
Reproduction Steps
Deploy v1.4.3 of the Landing Zone Accelerator on AWS.
For a smaller sample that can be extracted without deploying the entire LZA solution, you may use this custom resource construct that is used by LZA to create the service-linked roles:
Possible Solution
Add lambda:GetFunctionConfiguration
permission to the provider Lambda function's IAM role.
Additional Information/Context
No response
CDK CLI Version
2.79
Framework Version
No response
Node.js Version
16.20.1
OS
Amazon Linux
Language
Typescript
Language Version
No response
Other information
No response
this is probably related to #24358
The custom resource essentially check the functionActive
state before each invocation:
But it should work as expected.
Which region did you deploy?
Instead of running the LZA, are you able to provide a smallest code snippet that reproduces this issue?
@pahud Agreed on it likely being related to #24358, I'll work on a smaller snippet to reproduce the error. The error has been seen in at least us-east-1 and ap-southeast-2, but we've heard reports from 5+ customers so I believe the error to be region-agnostic. I'll work on a smaller snippet to reproduce the behavior more predictably.
Are you able to see how many custom resources of this will be created in your LZA deployment?
@pahud Still working on a smaller reproducible snippet. LZA creates 6 individual custom resources to create the service-linked roles.
I've added some more details below, please advise whether it's still necessary to provide a reproducible code-snippet.
This issue is exacerbated by logic that may run these custom resources on every pipeline run, for instance: awslabs/landing-zone-accelerator-on-aws#237
The following line and the surrounding retry logic requires the as mentioned lambda:GetFunctionConfiguration
permission to function: https://github.com/aws/aws-cdk/blob/c695b6004219426cf0e67cbb92d916a394ddd594/packages/aws-cdk-lib/custom-resources/lib/provider-framework/runtime/outbound.ts#L66C15-L66C15
However only invokeFunction is granted on the onEventHandler Lambda function:
I believe the above provider.ts
likely needs a fn.addToRolePolicy
added after the above line granting "lambda:GetFunctionConfiguration" on this.onEventHandler.functionArn
.
Encountered same problem with EKS Blueprint, which is using kubectlProvider
created from https://github.com/aws/aws-cdk/blob/v2.96.0/packages/aws-cdk-lib/aws-eks/lib/kubectl-provider.ts#L144.
The error message on Cloudformation said:
Custom::AWSCDK-EKS-KubernetesResource EKSStackAwsAuthmanifest75D20040 CREATE_FAILED Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"} at checkExceptions ...
Also found the following error message in CloudTrail:
User: arn:aws:sts::[redacted]:assumed-role/EKSStack-awscdka-ProviderframeworkonEventS-duAmMCwNZO6z/EKSStack-awscdka-ProviderframeworkonEvent-4NeK4zj7Z6ab is not authorized to perform: lambda:GetFunctionConfiguration on resource: arn:aws:lambda:[redacted]:[redacted]:function:EKSStack-awscdkawseksKube-Handler886CB40B-3fOpwrZnomNI because no identity-based policy allows the lambda:GetFunctionConfiguration action
After added lambda:GetFunctionConfiguration
permision to IAM role of the function, the template can be deployed.
The waiter call changed in c3a4b7b from waitUntilFunctionActive
to waitUntilFunctionActiveV2
. This changed the required IAM permission from lambda:GetFunctionConfiguration
to lambda:GetFunction
.
@markhankins Yes, it looks like the original PR was abandoned. I am not prepared to create one anytime soon.
I was just commenting to inform any would-be submitters or reviewers that the API call and therefore required fix have changed somewhat since the original title and description of this issue were written.
any solution for this
probably related to #24358
@blinkdaffer @ejt4x @markhankins
Can you tell me which region(s) are you seeing this error?
Are you able to provide a very tiny CDK app that we can deploy in that region and reproduce this error?
@pahud Here's my simplest way of producing this error.
const thisLambdaDoesNotExist = Function.fromFunctionName(this, 'NonExistentLambda', 'fakelambda');
const provider = new Provider(this, 'Provider', {
onEventHandler: thisLambdaDoesNotExist,
});
new CustomResource(this, 'Resource1', { serviceToken: provider.serviceToken });
The actual exception (throttling, function does not exist, whatever) is swallowed by the try/except block, leaving the following error on the CFN event log:
Resource1
CREATE_FAILED
Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"} at checkExceptions (/var/runtime/node_modules/@aws-sdk/node_modules/@smithy/util-waiter/dist-cjs/index.js:59:26) at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/index.js:5933:49) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async defaultInvokeFunction (/var/task/outbound.js:1:875) at async invokeUserFunction (/var/task/framework.js:1:2192) at async onEvent (/var/task/framework.js:1:369) at async Runtime.handler (/var/task/cfn-response.js:1:1573)
If we go digging in CloudTrail, we find this IAM error
"errorMessage": "User: arn:aws:sts::[redacted]-ProviderframeworkonEvent-jKCdLDqBfAP0 is not authorized to perform: lambda:GetFunction on resource: arn:aws:lambda:us-west-2:redacted:function:fakelambda because no identity-based policy allows the lambda:GetFunction action",
All these errors mask the actual issue - the user lambda invocation failed due to throttling, non-existence, or some other reason. The missing IAM permission prevents this from being discovered by the user