/OEB_level2_data_migration

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

OEB_workflows_data_migration (BETA version)

Description

Application used by community managers to migrate results of a benchmarking workflow (for instance, from Virtual Research Environment) to OpenEBench scientific benchmarking database. It takes the minimal datasets from the 'consolidated results' from the workflow, adds the rest of metadata to validate against the Benchmarking Data Model, and the required OEB keys, builds the necessary TestActions, and finally pushes them to OpenEBench temporary database.

Prerequisites for moving workflow results to OEB

In order to use the migration tool, some requirements need to be fulfilled:

  • The benchmarking event, challenges, metrics, and input/reference datasets that the results refer to should already be registered in OpenEBench and have official OEB identifiers.
  • IDs of challenges and metrics used in the workflow should be annotated in the correspondent OEB objects (in the _metadata:level_2 field) so that the can be mapped to the registered OEB elements.
  • The tool that computed the input file' predictions should also be registered in OpenEBench.
  • The 'consolidated results' file should come from a pipeline that follows the OpenEBench Benchmarking Workflows Standards. (If any of these requirements is not satisfied, a form should be provided so that the manager or developer can 'inaugurate' the required object in OEB)
  • NOTE: this tool just uplifts and upload benchmarking workflow generated minimal datasets to OEB database, it does NOT update the reference aggregation dataset referenced by the workflow (for instance, the one used by VRE to instantiate the workflow). In order to update official OEB aggregation datasets to a VRE workflow, please contact OEB team, so they can copy them manually to the corresponding reference directory (/gpfs/VRE/public/aggreggation/<workflow_name>)

Parameters

usage: push_data_to_oeb.py [-h] -i DATASET_CONFIG_JSON -cr OEB_SUBMIT_API_CREDS [-tk OEB_SUBMIT_API_TOKEN]
                           [--val_output VAL_OUTPUT] [--skip-min-validation] [-o SUBMIT_OUTPUT_FILE] [--dry-run]
                           [--trust-rest-bdm] [--log-file LOGFILENAME] [-q] [-v] [-d]
                           [--payload-mode {as-is,threshold,force-inline,force-payload}]

OEB Level 2 push_data_to_oeb

optional arguments:
  -h, --help            show this help message and exit
  -i DATASET_CONFIG_JSON, --dataset_config_json DATASET_CONFIG_JSON
                        json file which contains all parameters for dataset consolidation and migration (default: None)
  -cr OEB_SUBMIT_API_CREDS, --oeb_submit_api_creds OEB_SUBMIT_API_CREDS
                        Credentials and endpoints used to obtain a token for submission to oeb sandbox DB (default: None)
  -tk OEB_SUBMIT_API_TOKEN, --oeb_submit_api_token OEB_SUBMIT_API_TOKEN
                        Token used for submission to oeb buffer DB. If it is not set, the credentials file provided with -cr
                        must have defined 'clientId', 'grantType', 'user' and 'pass' (default: None)
  --val_output VAL_OUTPUT
                        Save the JSON Schema validation output to a file (default: /dev/null)
  --skip-min-validation
                        If you are 100{'option_strings': ['--skip-min-validation'], 'dest': 'skip_min_validation', 'nargs': 0,
                        'const': True, 'default': False, 'type': None, 'choices': None, 'required': False, 'help': 'If you are
                        100% sure the minimal dataset is valid, skip the early validation (useful for huge datasets)',
                        'metavar': None, 'container': <argparse._ArgumentGroup object at 0x7f3596047a90>, 'prog':
                        'push_data_to_oeb.py'}ure the minimal dataset is valid, skip the early validation (useful for huge
                        datasets) (default: False)
  -o SUBMIT_OUTPUT_FILE
                        Save what it was going to be submitted in this file (default: None)
  --dry-run             Only validate, do not submit (dry-run) (default: False)
  --trust-rest-bdm      Trust on the copy of Benchmarking data model referred by server, fetching from it instead from GitHub.
                        (default: False)
  --log-file LOGFILENAME
                        Store logging messages in a file instead of using standard error and standard output (default: None)
  -q, --quiet           Only show engine warnings and errors (default: None)
  -v, --verbose         Show verbose (informational) messages (default: None)
  -d, --debug           Show debug messages (use with care, as it could potentially disclose sensitive contents) (default: None)
  --payload-mode {as-is,threshold,force-inline,force-payload}
                        On Dataset entries, how to deal with inline and external payloads (default: as-is)

The minimal/partial dataset to be uplifted to the OpenEBench benchmarking data model should validate against the schema minimal_bdm_oeb_level2.yaml available here using ext-json-validate, with a command-line similar to:

ext-json-validate --iter-arrays --guess-schema oeb_level2/schemas/minimal_bdm_oeb_level2.yaml minimal_dataset_examples/results_example.json

An example of the dataset is available here.That dataset should be declared through a config.json file declaring the URL or relative path where it is (it should follow JSON Schema submission_form_schema.json available here, you have an example here), and set up an auth_config.json with the different credentials (template here and JSON Schema auth_config_schema.json available here).

# The command must be run with the virtual environment enabled

# This one uplifts the dataset, but it does not load the data in the database
python push_data_to_oeb.py -i config.json -cr your_auth_config.json --trust-rest-bdm --dry-run -o uplifted.json

You can also validate in depth the uplifted dataset just with next command

ln -s uplifted.json uplifted.json.array
oeb-uploader.py --base_url https://openebench.bsc.es/api/scientific -cr your_auth_config.json --trust-rest-bdm --deep-bdm-dir . --dry-run uplifted.json.array

In order to upload the data to the sandbox, you can either use oeb-uploader.py telling the community:

ln -s uplifted.json uplifted.json.array
oeb-uploader.py --base_url https://openebench.bsc.es/api/scientific -cr your_auth_config.json --trust-rest-bdm --deep-bdm-dir . --community-id OEBC999 uplifted.json.array

or run the command again without the dry-run (and keeping a copy of the uploaded content)

python push_data_to_oeb.py -i config.json -cr your_auth_config.json --trust-rest-bdm -o uplifted.json

Last, remember to pass the uploaded entries from the sandbox to the staged database:

oeb-sandbox.py --base_url https://openebench.bsc.es/api/scientific -cr your_auth_config.json dry-run
oeb-sandbox.py --base_url https://openebench.bsc.es/api/scientific -cr your_auth_config.json stage

or to remove sandbox contents:

oeb-sandbox.py --base_url https://openebench.bsc.es/api/scientific -cr your_auth_config.json discard

Development

First, install the Python development dependencies in the very same virtual environment as the runtime ones:

python3 -m venv .py3env
source .py3env/bin/activate
pip install -r dev-requirements.txt -r mypy-requirements.txt
pre-commit install

so every commit is checked against pylint and mypy before is accepted.

If you change oeb_level2/schemas/submission_form_schema.json you have to run one of the next commands:

pre-commit run --hook-stage manual jsonschema-gentypes

# or

pre-commit run -a --hook-stage manual jsonschema-gentypes

in order to re-generate oeb_level2/schemas/typed_schemas/ contents.