/retention_order_prediction

Code, Data and Results of the publication: "Liquid-Chromatography Retention Order Prediction for Metabolite Identification" by Bach et al. 2018

Primary LanguageHTMLOtherNOASSERTION

Overview

Scripts used to run the experiments presented in the paper:

"Liquid-Chromatography Retention Order Prediction for Metabolite Identification",

Eric Bach, Sandor Szedmak, Celine Brouard, Sebastian Böcker and Juho Rousu, 2018

Summary of the results shown in the paper (File needs to be downloaded and opened with a web-browser.).

Installation

There is no further installation required. The scripts run out of the box, if all the package dependencies are sattisfied. All the source code in this repository is under the MIT License.

Order prediction and evaluation code

The order predictor, e.g. RankSVM, and evaluation scripts are implemented in Python. The code has been tested with Python 3.5 and 3.6. The following packages are required:

  • scipy >= 0.19.1
  • json >= 2.0.9
  • numpy >= 1.13.1
  • joblib >= 0.11
  • pandas >= 0.20.3
  • sklearn >= 0.19.0
  • networkx >= 2.0
  • matplotlib >= 2.1 (optional)

Data pre-processing and evaluation report creation

The data pre-processing scripts as well as the script to reproduce the results shown in the paper are written in R. For the development R version 3.4 was used. The following packages are required:

Furthermore, the OpenBabel (>= 2.3.2) command line tool obabel must be installed only if the data pre-processing needs to be repeated.

Calculation of MACCS counting fingerprints

The rcdkTools package allows the computation of several counting fingerprints through the Chemical Development Kit (CDK).

Usage

All experiments of the paper can be reproduced by using the evaluation_scenarios_main.py script with the proper parameters:

usage: evaluation_scenarios_main.py <ESTIMATOR> <SCENARIO> <SYSSET> <TSYSIDX> <PATH/TO/CONFIG.JSON> <NJOBS> <DEBUG>
  ESTIMATOR:           {'ranksvm', 'svr'}, which order predictor to use.
  SCENARIO:            {'baseline', 'baseline_single', 'baseline_single_perc', 'all_on_one', 'all_on_one_perc', 'met_ident_perf_GS_BS'}, which experiment to run.
  SYSSET:              {10, imp, 10_imp}, which set of systems to train on.
  TSYSIDX:             {-1, 0, ..., |sysset| - 1}, which target system to use for evaluation.
  PATH/TO/CONFIG.JSON: configuration file, e.g. PredRet/v2/config.json
  NJOBS:               How many jobs should run in parallel for hyper-parameter estimation?
  DEBUG:               {True, False}, should we run a smoke test.
SCENARIO Description Reference in the Paper
baseline_single Single system used as training and target Table 3, Table 4 (first two columns)
baseline_single_perc Single system used as training and target. Different percentage of data used for trainging. Figure 4 (stroked lines)
all_on_one All systems used for training. Single system used as target. Target system in training (LTSO): True & False Table 4, LTSO = True 3. & 4. column, LTSO = False 5. & 6. column
all_on_one_perc All systems used for training. Single system used as target. Varying percentage of target system data used for training Figure 4 (solid lines)

Example: Reproducing results shown in Table 3:

The following function calls are need:

MACCS counting fingerprints:

python src/evaluation_scenarios_main.py ranksvm baseline_single 10 -1 results/raw/PredRet/v2/config.json 2 False

The results will be stored into:

results/PredRet/v2
                └── final
                    └── ranksvm_slacktype=on_pairs
                        └── allow_overlap=True_d_lower=0_d_upper=16_ireverse=False_type=order_graph
                            └── difference
                                └── maccsCount_f2dcf0b3
                                    └── minmax
                                        └── baseline_single

MACCS binary fingerprints:

Modify the results/raw/PredRet/v2/config.json configuration file:

"molecule_representation": {
  "kernel": "minmax",
  "predictor": ["maccsCount_f2dcf0b3"],
  "feature_scaler": "noscaling",
  "poly_feature_exp": false
}

becomes

"molecule_representation": {
  "kernel": "tanimoto",
  "predictor": ["maccs"],
  "feature_scaler": "noscaling",
  "poly_feature_exp": false
}

Then run:

python src/evaluation_scenarios_main.py ranksvm baseline_single 10 -1 results/raw/PredRet/v2/config.json 2 False

The results will be stored into:

results/PredRet/v2
                └── final
                    └── ranksvm_slacktype=on_pairs
                        └── allow_overlap=True_d_lower=0_d_upper=16_ireverse=False_type=order_graph
                            └── difference
                                └── maccs
                                    └── tanimoto
                                        └── baseline_single

How the results can be loaded and visualized is described here.

Citation

To refer the original publication please use:

@article{doi:10.1093/bioinformatics/bty590,
    author  = {Bach, Eric and Szedmak, Sandor and Brouard, Céline and Böcker, Sebastian and Rousu, Juho},
    title   = {Liquid-chromatography retention order prediction for metabolite identification},
    journal = {Bioinformatics},
    volume  = {34},
    number  = {17},
    pages   = {i875-i883},
    year    = {2018},
    doi     = {10.1093/bioinformatics/bty590},
    URL     = {http://dx.doi.org/10.1093/bioinformatics/bty590},
    eprint  = {/oup/backfile/content_public/journal/bioinformatics/34/17/10.1093_bioinformatics_bty590/2/bty590.pdf}
}