Polish metadata

Question

Polish metadata

Closed this issue 3 years ago · 1 comments

I looked at the metadata that's been assembled (https://github.com/biosimulations/biosimulations-physiome/blob/dev/projects.json). It looks pretty good. It looks like a few attributes need to be transformed for BioSimulators-utils and a couple of additional pieces of information could be scraped (markdown formatting for description, license, timestamp from Git commit)

identifier:
- Because of the two namespaces (e and exposure), I think the identifiers need to be e/xxx or exposure/xxxxxxxxxxxxxx
- BioSimulators-utils utils expects the key to be identifiers rather than identifier.
- BioSimulators-utils utils expects the value to be a list of dictionaries with keys
  - uri: http://identifiers.org/pmr:e/xxx
  - label: pmr:e/xxx
hash: FYI, BioSimulators-utils will ignore this. The hash could be encoded into the source. See next bullet.
source:
- BioSimulators-utils expects this key to be sources rather than source
- BioSimulators-utils expects the value to be a dictionary with two keys
  - uri: http://identifiers.org/pmr.workspace:35f/@@file/81ef7ed4cf06f0cd4b87da239d282fc559738796 (I think the hash could be encoded here; I just requested a identifiers.org prefix for this)
  - label: pmr.workspace:35f/@@file/81ef7ed4cf06f0cd4b87da239d282fc559738796
title: Looks good

description: Markdown can be used to capture the formatting. Here's a sketch of how it can be done

import bs4
import markdownify
import requests

response = requests.get('https://models.physiomeproject.org/e/3fd')
response.raise_for_status()
html = response.content

doc = bs4.BeautifulSoup(html)
content_core = doc.find(id='content-core').find('div')
for table in content_core.find_all(class_='tmp-doc-informalfigure table'):
    table.decompose()
for image in content_core.find_all('img'):
    image.decompose()

description = markdownify.MarkdownConverter().convert_soup(content_core).strip()

summary: empty strings ("") should be converted to null
thumbnails: For BioSimulators-utils, the thumbnails will need to be downloaded and the values of the thumbnails attribute will need to be converted to a path within the COMBINE archive (i.e. strip off everything up to the identifier)
tags:
- BioSimulators-utils expects this key to be keywords rather than tags
- BioSimulators-utils expects the value of this key to be a list of strings
citation
- BioSimulators-utils expects this key to be references rather than citation
- BioSimulators-utils expects the value to be a list of dictionaries with keys
  - uri: http://identifiers.org/pubmed/19486676
  - label: An integrated model of eicosanoid metabolism and signaling based on lipidomics flux analysis. .... This method can be used to look up more complete metadata and generate a human-readable label for references https://github.com/biosimulators/Biosimulators_utils/blob/f8370913679828b6a45dad047123c0ab84a6f43d/biosimulators_utils/ref/utils.py#L23.
authors: BioSimulators-utils expects authors to be a list of dictionaries with keys
- uri: null (preferably this would be ORCIDs, but we don't know these)
- label: e.g., Geoffrey Nunns
contributors: BioSimulators-utils expects contributors to be a list of dictionaries with keys
- uri: http://identifiers.org/orcid:0000-0001-5801-5510 (preferably ORCID, another URI is fine too such as your personal website, GitHub profile, etc.)
- label: Bilal Shaikh
license:
- I think we should scrape this because people care about preserving license information.
- Most, but not all models are licensed CC BY 3.0. I think we need to scrape this from the web pages. Some pages say The terms of use/license for this work is unspecified. (i.e., "license": null). Some pages don't say anything about licenses, which I guess we can interpret as "license": null.
  - I think you could scrape this text from each model, calculate the set of unique strings (possibly as small as 2), and then assign each to the appropriate SPDX id.
- BioSimulators-utils expects the license to be captured as a dictionary with two keys
  - uri: http://identifiers.org/spdx:CC-BY-3.0
  - label: CC BY 3.0
created: I think this could be set equal to the timestamp for the git commit.

Answer 1 · 2022-04-11T15:06:50.000Z

Replaced with other issues