Metabolic Allele Classifiers (MACs) are flux balance analysis-based genome classifiers that achieve prediction accuracy on par with state-of-the-art machine learning approaches while enabling a mechanistic interpretation of the genotype-phenotype map. MACs thus provide a FBA framework for modeling datasets used in microbial genome-wide association studies (GWAS). Inference and computation of MACs utilize cobrascape, which wraps around COBRApy package,
$ git clone https://github.com/erolkavvas/metabolic-allele-classifiers.git
$ cd metabolic-allele-classifiers
$ pip install -r requirements.txt
$ cd metabolic-allele-classifiers/
$ python 01_sample_macs.py -f input_data -s 2 -o mnc_ensemble_0 --testsize 0.9
$ python 02_eval_macs.py -f mnc_ensemble_0
- Run
$ python 01_sample_macs.py -f INPUT_DIR -s NUM_SAMPLES -o MAC_DIR [optional parameters...]
INPUT_DIR
: See Input Data section below.NUM_SAMPLES
: Number of samples to generate (recommend 2 for test run, but >1000 samples for meaningful results)MAC_DIR
: Path to MAC ensemble directory.
- Run
$ python 02_eval_macs.py -f MAC_DIR [--testset --bicthresh]
MAC_DIR
: Path to MAC ensemble directory. Same as one used in step 1.testset
: Set True to evaluate MACs on test set. Default is False, which evaluates MACs over training set.bicthresh
: Threshold for selectings MACs based the Bayesian Information Criteria (BIC). Default is 10, as recommended in the literature.
01_sample_macs.py
generates MAC samples, which requires performing the computationally expensive flux variability analysis (popFVA). 02_eval_macs.py
solves the MAC for the training set or testset and saves the MAC solutions. An ensemble of sampled MACs can be generated by running steps 1 and 2 multiple times for the same output directory. The input parameters must be consistent for the same output directory (except for NUM_SAMPLES). This is helpful because generating samples takes a long time so the simulations are likely to be interrupted at some point. See commented arguments in the python files for details on optional parameters.
Once a large number of MACs has been generated (~1000+), run through analyze_macs.ipynb
to generate an excel sheet containing a summary of MAC results. The output file is located at ENSEMBLE_DIR/supplementary/Supplementary File XYZ.xlsx
- The following files are to be placed in INPUT_DIR. The files should be named exactly as described below.
MODEL_FILE.json
(REQUIRED). A genome-scale model in json formatX_ALLELES_FILE.csv
(REQUIRED). A strain allele matrix with shape (strains, alleles). Alleles (columns) should have the same name as the gene ids in the genome-scale model (with _num appended to specify the allele id)Y_PHENOTYPES_FILE.csv
(REQUIRED). A strain phenotypes matrix with shape (strains, phenotypes)GENE_LIST_FILE.csv
(OPTIONAL). A pandas series or dataframe with index specifying the list of genes to limit the modeled alleles to. Highly recommended <200 genes. otherwise sample deeply
cobra_model/gene_to_pathways.json
. JSON object describing mapping of genes to pathways for the organism. Required for pathway enrichments.gene_to_name.json
. JSON object describing mapping of gene ids to names. OPTIONAL.
Strain allele matrix and strain phenotypes matrix from (https://rdcu.be/9rHj) and cobra model from (https://rdcu.be/bG6JO).