Metabolic Allele Classifiers

Metabolic Allele Classifiers (MACs) are flux balance analysis-based genome classifiers that achieve prediction accuracy on par with state-of-the-art machine learning approaches while enabling a mechanistic interpretation of the genotype-phenotype map. MACs thus provide a FBA framework for modeling datasets used in microbial genome-wide association studies (GWAS). Inference and computation of MACs utilize cobrascape, which wraps around COBRApy package,

Installation

$ git clone https://github.com/erolkavvas/metabolic-allele-classifiers.git  
$ cd metabolic-allele-classifiers  
$ pip install -r requirements.txt

Test run

$ cd metabolic-allele-classifiers/  
$ python 01_sample_macs.py -f input_data -s 2 -o mnc_ensemble_0 --testsize 0.9   
$ python 02_eval_macs.py -f mnc_ensemble_0

Estimating a Metabolic Allele Classifier

Run $ python 01_sample_macs.py -f INPUT_DIR -s NUM_SAMPLES -o MAC_DIR [optional parameters...]
- INPUT_DIR: See Input Data section below.
- NUM_SAMPLES: Number of samples to generate (recommend 2 for test run, but >1000 samples for meaningful results)
- MAC_DIR: Path to MAC ensemble directory.
Run $ python 02_eval_macs.py -f MAC_DIR [--testset --bicthresh]
- MAC_DIR: Path to MAC ensemble directory. Same as one used in step 1.
- testset: Set True to evaluate MACs on test set. Default is False, which evaluates MACs over training set.
- bicthresh: Threshold for selectings MACs based the Bayesian Information Criteria (BIC). Default is 10, as recommended in the literature.

01_sample_macs.py generates MAC samples, which requires performing the computationally expensive flux variability analysis (popFVA). 02_eval_macs.py solves the MAC for the training set or testset and saves the MAC solutions. An ensemble of sampled MACs can be generated by running steps 1 and 2 multiple times for the same output directory. The input parameters must be consistent for the same output directory (except for NUM_SAMPLES). This is helpful because generating samples takes a long time so the simulations are likely to be interrupted at some point. See commented arguments in the python files for details on optional parameters.

Interpreting Metabolic Allele Classifiers

Once a large number of MACs has been generated (~1000+), run through analyze_macs.ipynb to generate an excel sheet containing a summary of MAC results. The output file is located at ENSEMBLE_DIR/supplementary/Supplementary File XYZ.xlsx

Input Data

The following files are to be placed in INPUT_DIR. The files should be named exactly as described below.
- MODEL_FILE.json (REQUIRED). A genome-scale model in json format
- X_ALLELES_FILE.csv (REQUIRED). A strain allele matrix with shape (strains, alleles). Alleles (columns) should have the same name as the gene ids in the genome-scale model (with _num appended to specify the allele id)
- Y_PHENOTYPES_FILE.csv (REQUIRED). A strain phenotypes matrix with shape (strains, phenotypes)
- GENE_LIST_FILE.csv (OPTIONAL). A pandas series or dataframe with index specifying the list of genes to limit the modeled alleles to. Highly recommended <200 genes. otherwise sample deeply
cobra_model/gene_to_pathways.json. JSON object describing mapping of genes to pathways for the organism. Required for pathway enrichments.
gene_to_name.json. JSON object describing mapping of gene ids to names. OPTIONAL.