The Experimental Natural Products Knowledge Graph workflow aims at integrating experimental LC-MS/MS DDA metabolomics data into a Wikidata-connected knowledge graph. To allow for iterative addition of samples over time, data from each sample is processed individually.
For each sample, the required input data are
- A minimal metadata file containing the sample's originating taxon.
- The LC-MS/MS DDA data (positive and/or negative ionization modes).
After MZmine processing, the workflow automatically resolves the species taxonomy against Open Tree of Life (ottID), generates a Molecular Network from fragmentation spectra (MN) and annotates features using two different methods (spectral matching to in silico DB coupled to taxonomical reweighting and Sirius/CSI:FingerID).
Once the processing on individual samples is done, for annotated compounds, Wikidata ID and NPClassifier ontology is automatically retrieved and it is possible to integrate compounds with activity reported against one (or more) selected biological target in ChEMBL DB. To compare the spectral fingerprint of the samples, the generated data structure is compatible with a MEMO analysis.
Finally, all of the data previously generated is integrated into a sample-specific RDF knowledge graph. These sample-specific KG from multiple specific can be combined to effectively compare samples based on their metadata and their spectral and structural data. The graph structure allow for optimal query using the SPARQL language and is fully compatible for subsequent addition of samples.
The different steps are described below, with the link to the corresponding repository to perform the analysis:
You will need to have Git and Anaconda (or Miniconda) installed.
These steps needs to be run only once for each sample. 🚀
Aim: Organize output from MzMine in individual folders for each sample.
Repository: https://github.com/enpkg/enpkg_data_organization
Aim: Resolve taxonomy for each sample and link it to Wikidata.
Repository: https://github.com/enpkg/enpkg_taxo_enhancer
Aim: MN generation, ISDB and MS1 annotation coupled to taxonomical and chemical consistency reweighting on each sample.
Repository: https://github.com/enpkg/enpkg_mn_isdb_taxo
Aim: Perform SIRIUS/CSI:FingerID/CANOPUS on each sample.
Repository: https://github.com/enpkg/enpkg_sirius_canopus
Aim: Retrieve NPClassifier taxonomy and WD ID for each annotated compoud.
Repository: https://github.com/enpkg/enpkg_meta_analysis
Aim: Build a knowledge graph for each sample integrating the data generated above.
Repository: https://github.com/enpkg/enpkg_graph_builder
- MEMO: Generate MEMO matrix from samples' spectral data (optional)
- chembl_fetcher: download compounds with a reported activity in ChEMBL against a given target (optional)
Compounds retrieved from ChEMBL can be formatted as RDF (Step 5) and integrated in the genearted KG.
Repository: https://github.com/enpkg/enpkg_meta_analysis