kislyuk/aegea

AWS thinks ComputeEnvironment exists

Bek opened this issue · 6 comments

Bek commented

After deleting image named 'dex',

aws ecr delete-repository --repository-name dex

attempt to create it anew gives the following stack trace:

Traceback (most recent call last):
  File "/Users/bek/.pyenv/versions/2.7.13/bin/aegea", line 23, in <module>
    aegea.main()
  File "/Users/bek/.pyenv/versions/2.7.13/lib/python2.7/site-packages/aegea/__init__.py", line 80, in main
    result = parsed_args.entry_point(parsed_args)
  File "/Users/bek/.pyenv/versions/2.7.13/lib/python2.7/site-packages/aegea/build_docker_image.py", line 107, in build_docker_image
    job = submit(submit_args)
  File "/Users/bek/.pyenv/versions/2.7.13/lib/python2.7/site-packages/aegea/batch.py", line 265, in submit
    ensure_queue(args.queue)
  File "/Users/bek/.pyenv/versions/2.7.13/lib/python2.7/site-packages/aegea/batch.py", line 242, in ensure_queue
    create_compute_environment(cce_parser.parse_args(args=[name]))
  File "/Users/bek/.pyenv/versions/2.7.13/lib/python2.7/site-packages/aegea/batch.py", line 103, in create_compute_environment
    serviceRole=batch_iam_role.arn)
  File "/Users/bek/.pyenv/versions/2.7.13/lib/python2.7/site-packages/botocore/client.py", line 253, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/bek/.pyenv/versions/2.7.13/lib/python2.7/site-packages/botocore/client.py", line 557, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the CreateComputeEnvironment operation: Object already exists
Traceback (most recent call last):
  File "/Users/bek/.pyenv/versions/2.7.13/bin/aegea-build-image-for-mission", line 42, in <module>
    env=dict(os.environ, AEGEA_CONFIG_FILE=os.path.join(mission_wd, "config.yml"))
  File "/Users/bek/.pyenv/versions/2.7.13/lib/python2.7/subprocess.py", line 186, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '[u'aegea', u'build_docker_image', 'dex', u'--tags', u'AegeaMission=docker-example']' returned non-zero exit status 1

'dex' then shows up in the list of images but trying to run a Batch job throws the same error. Could this be an AWS API issue?

Thanks for reporting. Looking into this.

Bek commented

If it helps, just saw that aegea_batch compute environment on AWS console had this for its status:

Status INVALID
CLIENT_ERROR - Not authorized to perform sts:AssumeRole (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: 4bccfd3f-28bb-11e7-8ba4-7512a02ce50d)

Maybe the temporary credentials issued to a client lacks proper permissions?

Thanks for the follow-up. Unfortunately I'm not entirely sure what's going on here.

When you edit the aegea_batch compute environment in the AWS Console, what does the "Service role" line item say? What about "Instance role arn"?

This could potentially be an IAM eventual consistency problem that caused the CE to enter this state when you were first creating it. Could you try disabling and deleting the batch queue aegea.batch, disabling and deleting the compute environment aegea.batch, and running the aegea command again? If the CE doesn't properly delete itself, you may need to edit it.

Bek commented

Thanks for looking into this. You are right, it's consistency problem. Trying to delete the CE from the console and from the command line is failing, hence the error in botocore.

Service role is empty but the dropdown had 'aegea.batch.service' as one of the options, and 'instance role arn' is aegea.batch.ecs_container_instance.

Steps taken before it worked were:

  1. Changing the Service Role to AWSBatchServiceRole.
  2. Deleting CE
  3. Re-running the build-image command
    • resulted in invalid CE shown on console while build-image command on terminal was suspended
  4. Changed the CE Service Role to AWSBatchServiceRole
    • resulted in valid CE and build-image command was un-suspended and ran successfully

I suppose running the build-image command with --service-role AWSBatchServiceRole would work...

Thank you, and I'm glad you were able to figure out the solution. I think I know what's going on - I thought I had this (on-demand configuration of a new CE) figured out, but apparently not. I will test this again. Please let me know if you encounter any further issues.

Resolved offline and in latest release