nih-sparc/sparc-api

Documentation on Data Staging Environment: Getting test data into the curation dev staging

Opened this issue · 0 comments

Processing and displaying unpublished dasets sparc-app. A full guide

This issue is where the documentation for the data staging environment will be stored until the data staging environment has it's own repo.

An example of this running can be found at:
https://context-cards-demo-stage.herokuapp.com/maps
(note that only the maps page is implemented currently)

Because of this, we will first focus on the changes needed to display an unpublished or updated dataset on the /maps page. Much of this can be applied to the /data page, but a decent amount of work needs to be done to modify the data pulled from scicrunch. This is needed because of a few limitations on the amount of data we get from the unpublished vs published datasets.

Overview of steps

Part A: Running the current implementation
A1. Setting up environment variable
A2. Dataset processing from scicrunch

Part B: Understanding and modifying the current implemenation
B1. Modifying sparc-api requests to use pennsieveId as opposed to DOI
B2. Downloading files from the pennsieve python client as opposed to from s3
B3. Downloading dataset thumbnails from pennsieve (as opposed to discover)
B4. Front end changes

Part A: Staging datasets and running the staging site

Step 1: Setting up environment variables

The following will be needed to stage and retrieve the datasets:

There are two categories, ones that can be kept the same as normal development and those that need to be changed

Same as normal:

ALGOLIA_API_KEY=XXXXXXX
ALGOLIA_APP_ID=XXXX
AWS_USER_POOL_WEB_CLIENT_ID=XXXXX
KNOWLEDGEBASE_KEY=XXXXXXXXXXXXX

The pennsieve keys must have access to the desired datasets for staging to see them:

PENNSIEVE_API_TOKEN=XXXXXXX
PENNSIEVE_API_SECRET=XXXXXXX

And these are set to the curation index:

SCICRUNCH_HOST=https://scicrunch.org/api/1/elastic/SPARC_PortalDatasets_stage
ALGOLIA_INDEX=k-core_curation

**Note that ALGOLIA_INDEX is front end. It is set in sparc-app

Feel free to slack or email me if you are working on this and need any of these keys

Step 2: Dataset Processing

Datasets can be put through the scicrunch elastic search processing via a url.

2a: Check if staging is ready to run

https://sparc.scicrunch.io/sparc/stage?api_key=<KNOWLEDGEBASE_KEY>
where <KNOWLEDGEBASE_KEY> is your KNOWLEDGEBASE_KEY.

There is no queue for processing and datasets can only be processed one at a time. The status is used to check if the server is available and ready.

2b: Submit dataset for staging

Use this url:
https://sparc.scicrunch.io/sparc/stage?api_key=<KNOWLEDGEBASE_KEY>&datasetID=<pennsieve-id>

where is the pennsieve identifier. I.E. 5c0a31f6-4926-4091-8876-3b11af7846ed

Step 3: Running the site

3a: Retrieve the staging repos

Use the staging branch of sparc-api:
#157

And this branch of sparc-app:
https://github.com/Tehsurfer/sparc-app/tree/new-staging

Set the sparc-api endpoint on sparc-app to where you are running it. (Often http://localhost:5000/)

That should be everything needed to view staged datasets on the /maps page

Part B: Understanding and modifying the current implementation

Next we will go into how it works and how to develop it further.

I currently don't know of any tickets to develop this further, but I imagine there will be a ticket soon to be able to stage datasets across all of sparc.science with a logged in user's pennsieve keys.

B1. Modifying sparc-api requests to use pennsieveId as opposed to DOI

This is as simple as modifying the elastic search query to use pennsieveId as opposed to DOIS

def create_pennsieve_id_query(pennseiveId):
    query = {
        "size": 50,
        "from": 0,
        "query": {
            "term": {
                "item.identifier.aggregate": {
                    "value": f"N:dataset:{pennseiveId}"
                }
            }
        }
    }

    print(query)
    return query

The results from here can be processed with app/scicrunch_process_results.py. The function used is _prepare_results(results):

B2. Downloading files from the pennsieve python client as opposed to from s3

We first check which method of downloading we will use by the length of the id. (PennsieveIds are longer than discoverIds.) This is necessary as we don't get told which type of id is returned from scicrunch for unpublished datasets.

# This version of s3-resouces is used for accessing files on staging that have never been published
@app.route("/s3-resource/<path:path>")
def direct_download_url2(path):
    print(path)
    filePath = path.split('files/')[-1]
    pennsieveId = path.split('/')[0]

    # If length is small, we have a pennsieve discover id. We will process this one with the normal s3-resource route
    if len(pennsieveId) <= 4:
        return direct_download_url(path)

    if 'N:package:' not in pennsieveId:
        pennsieveId = 'N:dataset:' + pennsieveId

    url = bfWorker.getURLFromDatasetIdAndFilePath(pennsieveId, filePath)
    if url != None:
        resp2 = requests.get(url)
        return resp2.content
    return jsonify({'error': 'error with the provided ID '}, status=502)

Note that url = bfWorker.getURLFromDatasetIdAndFilePath(pennsieveId, filePath) retrieves a temporary download url from the pennsieve python client for a pennsieve id and file path.

If the dataset does have a discover id, we need to retrieve the pennsieve id to use on the pennsieve python client.

You could attempt to avoid making the call to scicrunch to translate the pennsieve id to discover id, but I did it this way to keep the downloads consistent at one point I believe.

B3. Downloading dataset thumbnails from pennsieve (as opposed to discover)

The banners unfortunately cannot be returned from the pennsieve python client or discover, so we must use the pennsieve REST api. The endpoint used for this is /getbanner

Note that in order to use the pennsieve REST api you must log in via the s3 auth system

@app.route("/get_banner/<datasetId>")
def get_banner_pen(datasetId):
    p_temp_key = pennsieve_login()
    ban = get_banner(p_temp_key, datasetId)
    return ban

B4. Front end changes

Since unpublished datasets return less content, checks likely need to be added to keep the site from accessing properties of undefined and crashing the site.

I did this by just adding more logic to check fields exist, but it may be a bit more complicated to implement on the /datasets page where a lot of processing is done in one big async data block.

The code to get this running on the /maps page is available here:
https://github.com/Tehsurfer/map-sidebar-curation

If you have any questions about the implementation or have ideas on how to do this better feel free to chat here or dm me.