/microbial_AMR_ML

Code repository for machine learning and computational analysis of large-scale microbial genomics data (https://rdcu.be/9rHj)

Primary LanguageJupyter Notebook

Machine learning of microbial pan-genomes

Computational platform applied to large-scale M. tuberculosis antimicrobial resistance (AMR) dataset, as described in,

ES. Kavvas, E. Catoui, N. Mih, JT. Yurkovich, Y. Seif, N. Dillon, D. Heckmann, A. Anand, L. Yang, V. Nizet, JM. Monk, BO. Palsson Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance, Nature Communications, (2018) 9:4306

alt text

Installation

git clone https://github.com/erolkavvas/microbial_AMR_ML.git

Primary scripts

  • 01_pairwise_tests.ipynb
    • Determines pairwise associations between pan-genome alleles and labeled phenotypes.
    • Generates Supplementary Data File 1
  • 02_ML_ensemble_SVM.ipynb
    • Performs machine learning (ensemble support vector machine) for selecting groups of alleles that are predictive of the labeled phenotypes.
    • Generates Supplementary Data File 2, Supplementary Data File 3, and svm_ensemble_data
  • 03_epistatic_analysis.ipynb
    • Uses the data generated by 02_ML_ensemble_SVM.ipynb to select an initial set of gene-gene pairs, and then performs gene-gene logistic regression modeling of these gene-gene pairs to identify statistical significant genetic interactions.
    • Generates cooccurence_table_excel, cooccurence_table_figures, and Supplementary Data File 4

Primary data structures

The following dataframes are required inputs for the computational platform.

  • cluster_info.csv
clust_to_rv gene_name ortho cog product refseq count score name_to_rv pan
Cluster 0 Rv2048c pks12 653045.Strvi_4160 Q Polyketide synthase AN47_01827 1590 7958.6 0 Core
Cluster 1 Rv3344c PE_PGRS49 0 0 PE-PGRS family protein X171_03503 794 0.0 0 Acces
... ... ... ... ... ... ... ... ... ... ...
  • pangen_allele_df.csv
Genome ID ... Cluster0_16 Cluster0_17 ...
1010834_3 ... 1 ...
1010835_3 ... 1 ...
1010836_3 ... 1 ...
... ... ... ... ...
  • pangen_cluster_df.csv
Genome ID Cluster 0 Cluster 1 Cluster 2 ...
1438838_3 1 1 0 ...
1408941_4 1 1 0 ...
1422035_3 1 0 0 ...
... ... ... ... ...
  • resistance_data.csv
genome_id isoniazid rifampicin ethambutol ...
1295764_3 R R R ...
1423468_3 R R S ...
... ... ... ... ...

External packages of note