Link up to AWS S3 buckets
Opened this issue · 1 comments
The NIH has recently started to make SRA data available directly on AWS S3. It would be cool if SRA-Explorer could also link to these.
The complication is that not all datasets are available, and they are spread across more than one S3 bucket. I think that the only way to get the URLs is to take the access number and build a "guess" S3 URI and then test it to see if it exists.
The current buckets are:
- https://s3.console.aws.amazon.com/s3/buckets/sra-pub-src-1/
- https://s3.console.aws.amazon.com/s3/buckets/sra-pub-src-2/
An example URL to a specific BAM file: http://sra-pub-src-1.s3.us-east-1.amazonaws.com/DRZ000036/F10-DA.bam.1 (possible to directly download without authentication).
The buckets should allow public and anonymous access, so we should be able to use an AWS SDK to ping the expected files to see if they exist. @wleepang gave a nice example in Python:
>>> import boto3
>>> from botocore import UNSIGNED
>>> from botocore.client import Config
>>> s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
>>> s3.head_bucket(Bucket='sra-pub-src-1')
{'ResponseMetadata': {'HTTPStatusCode': 200, 'RetryAttempts': 0, 'HostId': 'sE4i9sSiQmHwuBeBAKp8JUOsDq09BIoX/WtNQlmO+7qvmTe9/bwJfBqkCdAE0cdDg8Fspcbmddc=', 'RequestId': '931ABD9E2B59BA63', 'HTTPHeaders': {'date': 'Wed, 22 Jan 2020 20:38:56 GMT', 'x-amz-id-2': 'sE4i9sSiQmHwuBeBAKp8JUOsDq09BIoX/WtNQlmO+7qvmTe9/bwJfBqkCdAE0cdDg8Fspcbmddc=', 'server': 'AmazonS3', 'transfer-encoding': 'chunked', 'x-amz-request-id': '931ABD9E2B59BA63', 'x-amz-bucket-region': 'us-east-1', 'content-type': 'application/xml'}}}
Note that the files contained within each accession directory seem to be randomly named and quite variable. There are BAM files, FastQ files, Fasta files, all sorts. So we need a big warning notice to (a) let the user know that it's up to them to curate the file list that they're getting and (b) to count and warn about how many datasets we were unable to find.
Open data page for this is now up at https://registry.opendata.aws/ncbi-sra/