SysBioChalmers/yeast-GEM

feat: standardize and simplify curation of yeast-GEM

edkerk opened this issue ยท 10 comments

Description of the issue:

Currently, code/ and data/ contain various scripts and datasets in a variety of formats, which have previously been used to curate yeast-GEM. However, this is quite heterogenous, so that it is not that straightforward to use these scripts for future model curations.

In addition, there are a few other issues that are a hurdle to contribute towards development of yeast-GEM. One of these is that the model in- and output requires both MATLAB and COBRA toolboxes, which unnecessary increases software dependencies. This also raises the risk of conflicts between model files.

These and similar issues should be addressed to make yeast-GEM more accessible, reproducible and easier to contribute to. This consists of the following steps:

  • Introduce a function (curateMetsRxnsGenes()) and a standardized table format (as *.tsv file) that can be reused for adding new metabolites/reactions/genes, or curation of existing metabolites/reactions/genes. While this does not cover all types of curations (it does not allow for deletion of model entities), it would simplify & standardize many of the curations. Note that this can also be used to e.g. add or correct MetaNetX identifiers for all reactions, or change the subSystem assignment of reactions. This function is introduced in #300.
  • Reduce software dependencies for contributing to model development. As RAVEN is relatively lightweight, has essential functions and we have control over its codebase, we will only rely on RAVEN for generic functions (such as model in- and output, modifying biomass, etc.). Other software can be used unlimitedly by users, and specific curation scripts may still use e.g. COBRA if required, but the generic functions should only dependent on RAVEN. This is introduced in #301.
  • Related to the above, the documentation surrounding model curation should be updated. The README.md is modified in #301, but CONTRIBUTING.md should be overhauled, to reflect the above changes, and to give clear examples of how to implement this.
  • Ideally it would be nice to make scripts that can convert a model version to the next, similar as done in Sco-GEM. This scripts then calls for instance curateMetsRxnsGenes and refers to the relevant files with data, or could even directly make changes to the model.
  • The existing scripts and data files should be reorganized so that those that have generic use are readily available in the code and data folders, while files that have only been used once (to update one version to another) can be gathered in specific folders.
  • .... further ideas are welcome!

I hereby confirm that I have:

  • Tested my code with all requirements for running the model
  • Done this analysis in the main branch of the repository
  • Checked that a similar issue does not exist already
  • If needed, asked first in the Gitter chat room about the issue

Great to see this @edkerk.

  • Reduce software dependencies for contributing to model development.

Any thoughts on the /requirements folder?

That folder seems to refer strictly to Python/cobrapy, particularly useful for the GitHub Actions. But MATLAB-specific requirements are not part of that. The above points do not directly refer to GitHub Actions / CI, where the model is just loaded and its content tested. I tried to clarify this in the README.md in PR #301, but this can probably be improved?

Perhaps the point you raise should be rephrased as "Reduce software dependencies for contributing to MATLAB-based model development." So at least to reduce the complexity of the MATLAB-based pipeline. Contribution by using Python is also very welcome, but most of the generic functions are only available for MATLAB and I'm not sufficiently cobrapy-fluent to correct this.

This is now being implemented in PR #313, where multiple curations (incl. #305 and #306) are all documented in one script, that can change the current yeast-GEM release 8.6.0 to its next version.

A little road map:

  • First #305 and #306 need to be resolved and merged
  • Then #313 will refactor these curations (plus 19 new GPRs) into a consolidated script (draft here)
  • After review and approval of #313, it will be merged and a new release (8.6.1) will be made
  • From then on forwards, curations should follow this standardized structure. This means that #307, #314, #315 will need to be refactored at that point

Once releasing version 8.6.0, it probably becomes more clear how well this new approach will work in reality.

@edkerk coming back to the /requirements folder, one idea could be to move it under /code. To me, it belongs there more. If it were used just by GH Actions, it could then be moved to there the workflows are stored.

Sounds reasonable, but probably good to then move it to /code, as /code/io.py is not only for GH Actions.

requirements/ is now moved to code/requirements/ in dcf1cae

requirements/ is now moved to code/requirements/ in dcf1cae

very nice @edkerk. I think there are more place that should be updated, such as the Contributing guidelines.

I thought I found all references, but I missed Contributing guidelines.

With the recent deprecation of old files /code and /data in #345, perhaps this issue can be considered complete, and have any further ideas as new issues?