Polish metadata
Closed this issue · 1 comments
jonrkarr commented
I looked at the metadata that's been assembled (https://github.com/biosimulations/biosimulations-physiome/blob/dev/projects.json). It looks pretty good. It looks like a few attributes need to be transformed for BioSimulators-utils and a couple of additional pieces of information could be scraped (markdown formatting for description, license, timestamp from Git commit)
-
identifier
:- Because of the two namespaces (
e
andexposure
), I think the identifiers need to bee/xxx
orexposure/xxxxxxxxxxxxxx
- BioSimulators-utils utils expects the key to be
identifiers
rather thanidentifier
. - BioSimulators-utils utils expects the value to be a list of dictionaries with keys
uri
:http://identifiers.org/pmr:e/xxx
label
:pmr:e/xxx
- Because of the two namespaces (
-
hash
: FYI, BioSimulators-utils will ignore this. The hash could be encoded into thesource
. See next bullet. -
source
:- BioSimulators-utils expects this key to be
sources
rather thansource
- BioSimulators-utils expects the value to be a dictionary with two keys
uri
:http://identifiers.org/pmr.workspace:35f/@@file/81ef7ed4cf06f0cd4b87da239d282fc559738796
(I think the hash could be encoded here; I just requested a identifiers.org prefix for this)label
:pmr.workspace:35f/@@file/81ef7ed4cf06f0cd4b87da239d282fc559738796
- BioSimulators-utils expects this key to be
-
title
: Looks good -
description
: Markdown can be used to capture the formatting. Here's a sketch of how it can be doneimport bs4 import markdownify import requests response = requests.get('https://models.physiomeproject.org/e/3fd') response.raise_for_status() html = response.content doc = bs4.BeautifulSoup(html) content_core = doc.find(id='content-core').find('div') for table in content_core.find_all(class_='tmp-doc-informalfigure table'): table.decompose() for image in content_core.find_all('img'): image.decompose() description = markdownify.MarkdownConverter().convert_soup(content_core).strip()
-
summary
: empty strings (""
) should be converted tonull
-
thumbnails
: For BioSimulators-utils, the thumbnails will need to be downloaded and the values of the thumbnails attribute will need to be converted to a path within the COMBINE archive (i.e. strip off everything up to the identifier) -
tags
:- BioSimulators-utils expects this key to be
keywords
rather thantags
- BioSimulators-utils expects the value of this key to be a list of strings
- BioSimulators-utils expects this key to be
-
citation
- BioSimulators-utils expects this key to be
references
rather thancitation
- BioSimulators-utils expects the value to be a list of dictionaries with keys
uri
:http://identifiers.org/pubmed/19486676
label
:An integrated model of eicosanoid metabolism and signaling based on lipidomics flux analysis. ...
. This method can be used to look up more complete metadata and generate a human-readable label for references https://github.com/biosimulators/Biosimulators_utils/blob/f8370913679828b6a45dad047123c0ab84a6f43d/biosimulators_utils/ref/utils.py#L23.
- BioSimulators-utils expects this key to be
-
authors
: BioSimulators-utils expects authors to be a list of dictionaries with keysuri
:null
(preferably this would be ORCIDs, but we don't know these)label
: e.g.,Geoffrey Nunns
-
contributors
: BioSimulators-utils expects contributors to be a list of dictionaries with keysuri
:http://identifiers.org/orcid:0000-0001-5801-5510
(preferably ORCID, another URI is fine too such as your personal website, GitHub profile, etc.)label
:Bilal Shaikh
-
license
:- I think we should scrape this because people care about preserving license information.
- Most, but not all models are licensed
CC BY 3.0
. I think we need to scrape this from the web pages. Some pages sayThe terms of use/license for this work is unspecified.
(i.e.,"license": null
). Some pages don't say anything about licenses, which I guess we can interpret as"license": null
.- I think you could scrape this text from each model, calculate the set of unique strings (possibly as small as 2), and then assign each to the appropriate SPDX id.
- BioSimulators-utils expects the license to be captured as a dictionary with two keys
- uri:
http://identifiers.org/spdx:CC-BY-3.0
- label:
CC BY 3.0
- uri:
-
created
: I think this could be set equal to the timestamp for the git commit.