aws-samples/amazon-sagemaker-safe-deployment-pipeline

nyctaxi-deploy-prd fails

ehsanmok opened this issue · 6 comments

The pipeline fails to create the prod stack in SagemakerMonitoringSchedule because of

Resource handler returned message: "Error occurred during operation 'CREATE'." (RequestToken: 40af8897-76d2-abb5-6efc-ef8c6948d42b, HandlerErrorCode: GeneralServiceException)

Note that everything else is successful and the works in us-east-1

Hi @ehsanmok, are you using the latest code in master. The deploy role requires permissions to create monitoring schedule. The specific errors are not visible from CFN.

Yes, it's the latest CFT from the one-click launch button. The error is too generic and I can't find more details about it as well.

Hi @ehsanmok the CFN stack in s3 was out of date with the repository pipeline.yml. It has now been updated, but you can fix your stack by updating it with the pipeline.yml in the master branch.

This will update the DeployRole with the permissions sufficient to create the monitoring schedule.

Just updated with the master but still failed with the same error.

Hi @ehsanmok please ensure you updated the main nyctaxi stack, this will update the DeployRole which is used by the nyctaxi-deploy-prd stack. I've re-tested this from scratch and validate the the pipeline works, so perhaps start again with a clean CFN setup to re-test if still having issues.

Yes, updated the main CFT and released the changes.

First initial attempt to delete the main stack gave this error:

mlops-nyctaxi-deploy-role is invalid or cannot be assumed

though second attempt worked but had to delete all the artifacts, s3 bucket, endpoint, model etc. manually (can be automated with lambda and crhelper package). After recreating the entire stack again and running the mlops notebook, the pipeline fails to create nyctaxi-workflow with

Resource handler returned message: "State Machine is being deleted: 'arn:aws:states:us-east-1:ACCOUNT:stateMachine:nyctaxi' (Service: AWSStepFunctions; Status Code: 400; Error Code: StateMachineDeleting; Request ID: 218c294f-53a2-44ba-9256-4cb227b43fa9; Proxy: null)" (RequestToken: 66428fdb-9fb6-3309-5ed8-04e7d868dbd1, HandlerErrorCode: GeneralServiceException)

For the third time, deleted everything and recreated the stack. Now the prod is successful! Thanks for the very useful design :)