/AMRdiscover

Primary LanguageJupyter NotebookMIT LicenseMIT

AMRDiscover

Analyzing Antimicrobial Resistance Genes in NCBI Sequence Read Archive

BCM hackathon, August 2024.

Antimicrobial resistance (AMR) is a growing global health concern, driven by the overuse and misuse of antibiotics. Detecting and monitoring the presence of AMR genes in various environments is crucial for understanding the spread of resistance and informing public health strategies. The Logan database of assembled contigs and unitigs, derived from a massive freeze of the NCBI Sequence Read Archive (SRA), offers a unique and comprehensive resource for studying genetic material across a wide array of samples (Chikhi et al., 2024).

In this project, we will align the genes of CARD (Comprehensive Antibiotic Resistance Database) (Alcock et al., 2023) to the Logan database to identify and catalog AMR genes present in the dataset. By leveraging the highly efficient unitigs and contigs of Logan, we aim to detect AMR genes with high accuracy and sensitivity, despite the inherent complexities and challenges of working with such a large-scale dataset (Bradley et al., 2019)[see section “Antibiotic resistance genes in the ENA”]. This work will provide valuable insights into the distribution and prevalence of AMR genes across a vast range of environments and host organisms.

Advantages of Using Logan over Raw Read-based Approaches for AMR Gene Detection

  • Improved Contamination Control: By aligning contigs instead of raw reads, Logan minimizes contamination issues, leading to more accurate and reliable detection of AMR genes.
  • Increased Completeness: Logan offers a more complete dataset compared to previous subsets, providing a more comprehensive representation of the genetic material in the SRA.
  • Enhanced Sensitivity, Specificity, and Efficiency: Logan provides assembled sequences (unitigs or contigs) that reduce sequencing errors and improve sequence contiguity. This allows for more accurate AMR gene detection and can be analyzed more efficiently than raw reads, making it particularly beneficial for large-scale AMR gene detection projects.
  • Streamlined Analysis: The organization and accessibility of the Logan database allow for more efficient and scalable bioinformatics workflows, reducing computational overhead and improving overall analysis accuracy.
  • Data Accessibility: Logan is publicly available on AWS, making it easy to access and analyze. This promotes reproducibility and facilitates collaboration among researchers, streamlining the bioinformatics workflows and reducing computational overhead.

Flowchart

AFlowchart

Software

Here is the code to run the AMRDiscover pipeline!

./AMRdiscover.sh --input alignment_SAM_filenames.txt --download /path/to/download --upload /path/to/upload

Inputs

  • Alignment of contigs on CARD genes in SAM format
  • Location and collection date of samples

Results

CARD Database Analysis

To get an overview of the composition of the CARD database we performed some general analyses of the overall dataset.

Main Takeaways

  • Most organisms contribute few AMR genes to the database, while few organisms contribute the bulk of AMR genes
  • The top 4 organisms contributing AMR genes are the "usual suspects":
    • Pseudonomas aeruginosa
    • Acinetobacter baumannii
    • Klebsiella pneumoniae
    • Escherichia coli
  • The most prevalent antibiotic mechanism is antibiotic inactivation, but several other mechanisms are prevalent

Summary of Procedures

  1. Data Preparation
    1. Download (subset) unitigs/contigs from the Logan database.
      2. Obtain the CARD database containing curated sequences of known AMR genes (link, file) [Daniel to github]
      3. Download the metadata (data/location) of SRA accessions [Kristen]
      4. Parse the metadata of SRA accessions
    2. Alignment and Detection
      1. Align the sequences from the CARD database to the Logan unitigs/contigs using appropriate bioinformatics tools using minimap2 (Li, 2018) and Diamond (Buchfink et al., 2015)[work in progress]. []
      2. Identify and annotate matches, focusing on high-confidence alignments that suggest the presence of AMR genes.
    3. Post-Processing
      1. Filter and curate the results to remove low-confidence hits (alignment length, alignment identity using NM tag)
      2. Finding literature for AMR genes in SRA
        1. Specific biological question [Hassan]
      3. Summarize the findings in terms of the presence, distribution, and frequency of different AMR genes across the samples including metadata.
      4. Interpret the results considering the limitations of the approach.

Pseudomonas aeruginosa

Paeruginosa_plot

Acinetobacter baumannii

Abaumannii_plot

Klebsiella pneumoniae

Kpneumoniae_plot

Escherichia coli

Ecoli_plot

E coli : antibiotic mechanisms

E. coli tends to resist by efflux and alternating antibiotic target.

Spatial analyses

Possible Future Directions

  1. Annotation and Visualization:
    1. Develop scripts or pipelines to annotate AMR genes in the Logan dataset.
    2. Create visualizations (e.g., heatmaps, phylogenetic trees, geographic plots) to represent the distribution of AMR genes across samples.
  2. Statistical Analysis:
    1. Perform statistical tests to compare the prevalence of AMR genes across different environments or hosts.
    2. Investigate correlations between the presence of AMR genes and metadata (e.g., sample origin, sequencing platform).

This project will not only contribute to the understanding of AMR gene distribution but also provide participants with hands-on experience in handling large-scale genomic datasets and applying bioinformatics tools in a real-world context.

Extra information

The Logan database:

  • https://github.com/IndexThePlanet/Logan
  • The Logan database is a comprehensive collection of DNA and RNA sequences assembled from the entire NCBI Sequence Read Archive, offering an efficient and condensed representation of vast genomic data through unitigs and contigs.

The CARD database:

  • https://card.mcmaster.ca/

  • The CARD (Comprehensive Antibiotic Resistance Database) is a curated repository of sequences and associated data for known antimicrobial resistance genes, providing a critical resource for the identification and study of resistance mechanisms in various organisms.

  • For parsing CARD files, this code from this paper might be helpful.

    • aro_metadata.tsv & nucleotide_fasta_protein_homolog_model.fasta

Alignment results
Instructions from Rayan here

References

Alcock, B. P., Huynh, W., Chalil, R., Smith, K. W., Raphenya, A. R., Wlodarski, M. A., Edalatmand, A., Petkau, A., Syed, S. A., Tsang, K. K., Baker, S. J. C., Dave, M., McCarthy, M. C., Mukiri, K. M., Nasir, J. A., Golbon, B., Imtiaz, H., Jiang, X., Kaur, K., … McArthur, A. G. (2023). CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Research, 51D1, D690–D699. Bradley, P., den Bakker, H. C., Rocha, E. P. C., McVean, G., & Iqbal, Z. (2019). Ultrafast search of all deposited bacterial and viral genomic data. Nature Biotechnology, 37(2), 152–159.](http://paperpile.com/b/BaoF4C/Zuh9)
[Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 121, 59–60.
Chikhi, R., Raffestin, B., Korobeynikov, A., Edgar, R., & Babaian, A. (2024). Logan: Planetary-Scale Genome Assembly Surveys Life’s Diversity. In bioRxiv p. 2024.07.30.605881.
Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics , 3418, 3094–3100.

Team members:

Daniel, Sina, Abohassan, Kristen, Aanuoluwa, Christian, Jen-Yu, Narges, Francesco, Rayan. Screenshot 2024-08-30 at 16 38 23