kids-first/kf-api-dataservice

Add access_urls hybrid property to genomic file

dankolbman opened this issue · 3 comments

Currently, the genomic file urls field returns the list of file locations from gen3. These are raw file locations in the case that we own the file, however, there are some files that live within another gen3 deployment.

We should unify these to all be urls pointing to gen3 deployments and return them under access_urls . In the current situation, this means that all urls to files in our s3 buckets should be replaced with urls.

Moved to #484

Expanding on this, as I see it currently...

(nb. I'm using access_url instead of gen3_url to preserve generality)

Short term: s3:// file urls should be substituted with https://<our_gen3_domain>/<latest_did> before emitting

Medium term:

  • On POST/PATCH:
    • Ingest additionally specifies whose access credential domain should control access to the file. If it's ours (data.kidsfirst.yadayada), the submitted url gets loaded into our gen3 and we set access_url as our gen3 domain + the latest_did. Otherwise access_url is the submitted url.
    • Dataservice stores size/hashes/etc directly instead of later relying on responses from another server.
  • On GET:
    • Show size/hashes/etc directly from dataservice instead of reflecting values from another server.
    • access_url will be the new field that indicates where one should go to access the file.
    • raw_url (or whatever) could be optionally added if the file is one of ours, reflecting where our gen3 thinks the file is.
  • Portal ETL will:
    • Directly copy the access_url field for file access.
    • Either parse which credentials to use from the access_url or we could optionally include a field that indicates the access credential domain with GET.

access_url examples:

https://data.kidsfirst.yadayada/3b82fad9-55da-402f-a446-c86029720ff3
or
https://api.gdc.cancer.gov/data/3b82fad9-55da-402f-a446-c86029720ff3
but not
s3://kf-study-buckets-lol/3b82fad9-55da-402f-a446-c86029720ff3.bam

Yes? No? Maybe?

@fiendish lets move the medium term to a new issue and include some example requests/responses? Maybe a design document, if that feels more natural.

We also may consider instead storing the data url in the urls array for external files. That is, use the urls routing to /data/<uuid>. This is the true url that one would go to in order to download the actual file, as well as the url that the portal will route to.

DO

Given urls:

["s3://kf-study-buckets-lol/3b82fad9-55da-402f-a446-c86029720ff3.bam"]

Return access_urls:

["https://gen3.kidsfirst.com/data/3b82fad9-55da-402f-a446-c86029720ff3"]

DO NOT

Given urls:

["s3://kf-study-buckets-lol/3b82fad9-55da-402f-a446-c86029720ff3.bam"]

Return access_urls:

["https://gen3.kidsfirst.com/index/index/3b82fad9-55da-402f-a446-c86029720ff3"]