specification for ingesting information from NITRC
Remi-Gau opened this issue · 10 comments
Goal
Come up with a specification for information sharing between NITRC and other resources that, keyed by a NITRC Resource ID, would return a specified set of content (website, license, version, etc.) in a specified format for injection into the repo
Opening this thread to start a discussion.
Tagging: @dnkennedy
So, a possible idea is that in the docs (*.md) some sort of special 'function' can be defined, such as *NITRC(text, ID_Source=ID) would propagate some standard resource descriptor block into the rendered document. For example, including *NITRC("FSL", NITRC_ID=25) in one of the markdown docs would autofill content such as homepage, download link, short description, RRID, NITRC link, etc. into the final rendered page. Being a little generous about sources, would also permit sources like NITRC or RRID, etc.
So this 'specification' question has a number of parts: 1) how do we want the 'call' in the .md to look; 2) what do we want it to return; and 3) how would you like that return content formatted?
I guess it will be a different discussion as to how to implement the desired specification...
Let me know if this seems reasonable...
Yes that makes sense and is definitely reasonable.
I had a similar idea in mind.
- how would you like that return content formatted?
I think it is important to keep something that newcomers who want to help on this repo might be able to grasp quickly and won't be "intimidated" by.
So I am tempted to go json
but I would be interested to hear others take on this.
I briefly checked what NITRC requires for adding a new resource. This could be a basic json-schema for validating a new NITRC resource registration:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://nitrc.github.io/schemas/nitrc.schema.json",
"title": "NITRC resource metadata",
"description": "Metadata for registering a resource in NITRC",
"type": "object",
"properties": {
"resourceFullName": {"type": "string"},
"resourcePurposeAndSummary": {"type": "string"},
"domain": {"type": "string"},
"license": {"type": "string"},
"description": {"type": "string"},
"unixName": {"type": "string"},
"SCM": {"type": "string"},
"remoteRepository": {"type": "string", "format": "uri"},
"projectID": {"type": "string"},
"projectShortName": {"type": "string"}
},
"required": [ "resourceFullName", "resourcePurposeAndSummary",
"domain", "license", "description", "unixName",
"SCM", "remoteRepository", "projectID", "projectShortName"
]
}
It validates json data like this:
{
"resourceFullName": "supertool",
"resourcePurposeAndSummary": "A tool for doing spuer stuff",
"domain": "Things which are not yet super",
"license": "CC0",
"description": "Sometimes things are not super. You can change that",
"unixName": "st",
"SCM": "SCM",
"remoteRepository": "https://github.com/super/tool",
"projectID": "supertool",
"projectShortName": "st"
}
You can test it here:
https://www.jsonschemavalidator.net/
Obvisouly now I am thinking that it might then make sense to curate our own list of ressources using some sort of json and reinject the content into our markdown using the same approach.
It seems we are going full circle and going back to an "implementation" that is close to what Greg Kiar had suggested and used a bit a long time ago.
https://github.com/brainhack101/neurolinks
https://github.com/brainhack101/neurolinks/tree/master/jsons
@dnkennedy is the list of properties described by Roberto above covers most of the things we could extract from NITRC or is there some things that would be "dead easy" to get but are not in the submission forms?
The registration form is minimal, trying to make it 'relatively' easy to create a project. We try to get projects to add their homepage (if applicable), something to access (download or pointer to download, for example, in the case of software, documentation (or a pointer thereto), and filling the 'category' selections: i.e. This is an application software, it does statistical processing, it runs of specific OS, it is written in python, is expects certain data types, etc.
NITRC does attempt to expose this content via rdf: BrainBox, (https://www.nitrc.org/projects/brainbox) for example (NITRC_ID=1075) has associated with it: https://www.nitrc.org/projects/brainbox/biositemap.rdf
So, part of this might be 'simple' rdf2jsin...
The RDF is super nice for getting data from NITRC.
I was thinking the json-schema would simplify making the file that we'll e-mail/send to NITRC for adding a new resource.
With the schema we can make sure that there's a list of "required" informations, and we can also make sure that they have a priori the expected type of data (for example, a URL). We can also prevent people from adding any data that wouldn't be among the properties in the schema. All fields which are in the schema, but are not in the "required" list would be optional :)
There's also a few tools that can take a json-schema and generate a GUI for entering and validating the data. This one for example https://github.com/json-editor/json-editor, which you can try online here (with the schema from above!):
Sorry for the delay. So the above is a bit on the implementation side of things. From the specification point of view, what content do we want returned?
As an example, it seems tool template in the current version have the following description:
??? example "insert software name - insert short description"
- [code repository](insert GitHub or GitLab URL )
- [website](insert URL)
- [documentation](insert documentation or wiki URL)
- [contact](insert URL to mailing list, slack, forum, mattermost)
- programming language: {python}, {matlab/octave}, {C}, ...
- tags: {fMRI} {MEG} {EEG} {MRI} {nipype}
- paper
- RRID: insert_RRID_here
- tutorial:
- [URL]( insert URL )
- programming language: {python}, {matlab/octave}, {C}, ...
- level: {beginner} / {intermediate} / {advanced}
- tags: {video} {notebook}
- date:
- duration: HH:MM
- by: John Doe and Jane Doe
So if NITRC know any of these things about a given tool or resource, that's the content you'd like returned?
Well, let me amend the above. When citing a specific tool, if some information_source (i.e. NITRC) could provide standard information about the tool itself (e.g.
- [code repository](insert GitHub or GitLab URL )
- [website](insert URL)
- [documentation](insert documentation or wiki URL)
- [contact](insert URL to mailing list, slack, forum, mattermost)
- programming language: {python}, {matlab/octave}, {C}, ...
- tags: {fMRI} {MEG} {EEG} {MRI} {nipype}
- paper
- RRID: insert_RRID_here
Then all your annotator would have to do would be to add the tutorial specific stuff (or any of the 'standard tool stuff that wasn't provided by the information_source).
Sorry for the delay. So the above is a bit on the implementation side of things.
The json-schema is a bit of both. It's a specification (which are the fields, which are required/optional, what type of data they should contain), but it can turn into implementation rather easily (from simply producing a valid json, to generating a UI for data input).
In that sense, your description would be missing some stuff. For example, would all those fields be required (ex. "paper" if there's no paper)? Could I add extra fields which are not predetermined (ex. "license")? What's the expected type of data (ex., "date: July '18" versus "date: 25/07/2018"). I'd imagine:
• code repo, type: URL, required
• website, type: URL, optional
• doc, type: URL, required
• contact, type: e-mail, required
• programming language, type: array of strings, required
• tags, type: array of strings, optional
• paper, type:URL, optional
• RRID, type: [a regex?], required
and no other fields should be accepted?
Which are the use cases you have in mind? A resource in NITRC that people would annotate? A resource in the tutorials website that would be programmatically submitted to NITRC?
Ah right, my bad, we need 'use cases' before we can define specifications, before we can discuss imp-lementation!
Use Case 1: I'm adding a useful tool to the "neuroimaging tutorials and resources" website (https://learn-neuroimaging.github.io/tutorials-and-resources/), and the tool happens to have a NITRC entry.
Example: 3D Slicer (https://www.nitrc.org/projects/slicer)
Fran wants to add a PR to add "3D Slicer" to the list of diffusion processing tools at https://learn-neuroimaging.github.io/tutorials-and-resources/42-analysis_software_MRI/#Neuroimaging-analysis-software-for-MRI. "neuroimaging tutorials and resources" has a specific form, and Fran would like that pre-populated with the details that NITRC knows, and then to update that starting point with more details.
Use Case 2: I'm contributing a tool to the "neuroimaging tutorials and resources" website (https://learn-neuroimaging.github.io/tutorials-and-resources/) and the tool is not known to NITRC.
Example: Fran's Useful Tool (FUT). www.FUT.com
Fred wants to add a PR to add "Fran's Useful Tool" to be added to the 'Python libraries for QA' portion of https://learn-neuroimaging.github.io/tutorials-and-resources/60-quality-control/. Once the required information is provided according the the "neuroimaging tutorials and resources" template, Fred would like to contribute this information to NITRC, so that the basic information doesn't not have to be entered multiple times by multiple individuals.