Unable to read data from a non AGC S3 bucket using Snakemake

Question

Unable to read data from a non AGC S3 bucket using Snakemake

Closed this issue 2 years ago · 4 comments

Describe the Bug
I want to be able to read data from a non AGC bucket in the same account to run a simple Snakemake workflow that reads a file in S3 but i get a Forbidden error. I am using the latest version (1.5.1) and went through activation, configuration and context deployment as described in the user guide, along with setting up a data location to the s3 bucket and prefix my workflow will read from.

Upon running the workflow, I receive a workflow ID and I see a successful head node job spinning up, which eventually fails with the following error:

Building DAG of jobs...

Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/snakemake/init.py", line 725, in snakemake
success = workflow.execute(
File "/usr/local/lib/python3.8/site-packages/snakemake/workflow.py", line 775, in execute
dag.init()
File "/usr/local/lib/python3.8/site-packages/snakemake/dag.py", line 179, in init
job = self.update([job], progress=progress, create_inventory=True)
File "/usr/local/lib/python3.8/site-packages/snakemake/dag.py", line 759, in update
self.update_(
File "/usr/local/lib/python3.8/site-packages/snakemake/dag.py", line 874, in update_
selected_job = self.update(
File "/usr/local/lib/python3.8/site-packages/snakemake/dag.py", line 759, in update
self.update_(
File "/usr/local/lib/python3.8/site-packages/snakemake/dag.py", line 863, in update_
if not res.file.exists:
File "/usr/local/lib/python3.8/site-packages/snakemake/io.py", line 452, in exists
return self.exists_remote
File "/usr/local/lib/python3.8/site-packages/snakemake/io.py", line 246, in wrapper
v = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/snakemake/io.py", line 473, in exists_remote
return self.remote_object.exists()
File "/usr/local/lib/python3.8/site-packages/snakemake/remote/S3.py", line 79, in exists
return self._s3c.exists_in_bucket(self.s3_bucket, self.s3_key)
File "/usr/local/lib/python3.8/site-packages/snakemake/remote/S3.py", line 329, in exists_in_bucket
self.s3.Object(bucket_name, key).load()
File "/usr/local/lib/python3.8/site-packages/boto3-1.21.38-py3.8.egg/boto3/resources/factory.py", line 564, in do_action
response = action(self, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/boto3-1.21.38-py3.8.egg/boto3/resources/action.py", line 88, in call
response = getattr(parent.meta.client, operation_name)(*args, **params)
File "/usr/local/lib/python3.8/site-packages/botocore-1.24.38-py3.8.egg/botocore/client.py", line 415, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.8/site-packages/botocore-1.24.38-py3.8.egg/botocore/client.py", line 745, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
=== Running Cleanup ===
=== Bye! ===

Steps to Reproduce

create Snakmake workflow Snakefile_s3_sample.txt
install, setup and activate AGC 1.5.1 based on docs
deploy context with agc project yaml agc-project.yaml.txt
agc workflow run s3_example --context newOnDemandContext
monitor batch job (snakemake head node job) and inspect logs

Relevant Logs

Expected Behavior

I expect the workflow to be able to download the s3 file and do the desired operation based on the workflow task.
Actual Behavior

Error described above

Screenshots

n/a

Additional Context

Operating System:
AGC Version: 1.5.1
Was AGC setup with a custom bucket: no
Was AGC setup with a custom VPC: no

Answer 1 · 2022-10-06T19:23:10.000Z

Does the project.yaml file declare the required S3 bucket as a data source?

https://aws.github.io/amazon-genomics-cli/docs/concepts/data/
https://aws.github.io/amazon-genomics-cli/docs/concepts/projects/#data

Answer 2 · 2022-10-06T19:32:09.000Z

Yes it does (attachment in the issue - below might not format well)

`workflows:
s3_example:
type:
language: snakemake
version: 1.0
sourceURL: workflow/s3_example
contexts:
newOnDemandContext:
requestSpotInstances: false
engines:
- type: snakemake
engine: snakemake
data:

location: s3://mybucketname/snakemake_test/
readOnly: true`

Answer 3 · 2022-10-06T20:27:23.000Z

Can you try changing the location to use the pattern:

data:
  - location: s3://my-bucket/foo/*

The * is important (meaning anything with that prefix)

Answer 4 · 2022-10-07T15:58:26.000Z

Hi Mark, that seemed to fix the read from S3 bucket issue. Thank you