biosimulations/biosimulations-physiome

Polish metadata

Closed this issue · 1 comments

I looked at the metadata that's been assembled (https://github.com/biosimulations/biosimulations-physiome/blob/dev/projects.json). It looks pretty good. It looks like a few attributes need to be transformed for BioSimulators-utils and a couple of additional pieces of information could be scraped (markdown formatting for description, license, timestamp from Git commit)

  • identifier:
    • Because of the two namespaces (e and exposure), I think the identifiers need to be e/xxx or exposure/xxxxxxxxxxxxxx
    • BioSimulators-utils utils expects the key to be identifiers rather than identifier.
    • BioSimulators-utils utils expects the value to be a list of dictionaries with keys
      • uri: http://identifiers.org/pmr:e/xxx
      • label: pmr:e/xxx
  • hash: FYI, BioSimulators-utils will ignore this. The hash could be encoded into the source. See next bullet.
  • source:
    • BioSimulators-utils expects this key to be sources rather than source
    • BioSimulators-utils expects the value to be a dictionary with two keys
      • uri: http://identifiers.org/pmr.workspace:35f/@@file/81ef7ed4cf06f0cd4b87da239d282fc559738796 (I think the hash could be encoded here; I just requested a identifiers.org prefix for this)
      • label: pmr.workspace:35f/@@file/81ef7ed4cf06f0cd4b87da239d282fc559738796
  • title: Looks good
  • description: Markdown can be used to capture the formatting. Here's a sketch of how it can be done
    import bs4
    import markdownify
    import requests
    
    response = requests.get('https://models.physiomeproject.org/e/3fd')
    response.raise_for_status()
    html = response.content
    
    doc = bs4.BeautifulSoup(html)
    content_core = doc.find(id='content-core').find('div')
    for table in content_core.find_all(class_='tmp-doc-informalfigure table'):
        table.decompose()
    for image in content_core.find_all('img'):
        image.decompose()
    
    description = markdownify.MarkdownConverter().convert_soup(content_core).strip()
  • summary: empty strings ("") should be converted to null
  • thumbnails: For BioSimulators-utils, the thumbnails will need to be downloaded and the values of the thumbnails attribute will need to be converted to a path within the COMBINE archive (i.e. strip off everything up to the identifier)
  • tags:
    • BioSimulators-utils expects this key to be keywords rather than tags
    • BioSimulators-utils expects the value of this key to be a list of strings
  • citation
  • authors: BioSimulators-utils expects authors to be a list of dictionaries with keys
    • uri: null (preferably this would be ORCIDs, but we don't know these)
    • label: e.g., Geoffrey Nunns
  • contributors: BioSimulators-utils expects contributors to be a list of dictionaries with keys
    • uri: http://identifiers.org/orcid:0000-0001-5801-5510 (preferably ORCID, another URI is fine too such as your personal website, GitHub profile, etc.)
    • label: Bilal Shaikh
  • license:
    • I think we should scrape this because people care about preserving license information.
    • Most, but not all models are licensed CC BY 3.0. I think we need to scrape this from the web pages. Some pages say The terms of use/license for this work is unspecified. (i.e., "license": null). Some pages don't say anything about licenses, which I guess we can interpret as "license": null.
      • I think you could scrape this text from each model, calculate the set of unique strings (possibly as small as 2), and then assign each to the appropriate SPDX id.
    • BioSimulators-utils expects the license to be captured as a dictionary with two keys
      • uri: http://identifiers.org/spdx:CC-BY-3.0
      • label: CC BY 3.0
  • created: I think this could be set equal to the timestamp for the git commit.

Replaced with other issues