AMRFinder result analytic workflow
AMRFinder plus is a tool that takes a set of gene sequence data and returns an output table of high confidence matches to reference sequences or hidden markov models (HMM) for antimicrobial resistance and virulence determinants. This workflow documents input parameters for upstream steps to produce AMRFinder results for reproducibility as well as custom python scripts written to process output tables for downstream analysis and visualization.
-
Prepare reads for assembly
-
Assembly
-
Predict genes
-
Annotate genes with AMRFinder
-
Estimate in-situ gene coverage with CoverageMagic workflow
-
Post-processing
- filter for complete genes (00_amrfinder_filter.py)
- create binary presence/absence matrix (01_amrfinder_binary_matrix.py)
- estimate relative abundance of identified AMR genes (02_amrfinder_validate_and_summarize_RPKM.py)
-
Analysis in R
-
Produce figures in R
Since I work with human-associated metagenomes, in addition to trimming with trimmomatic, I usually attempt to remove reads aligning to reference human genomes. This can be efficiently done in one step with the Huttenhower lab tool kneaddata.
If not working with human-derived samples (e.g. isolate genomes), read quality control and trimming with trimmomatic is recommended (see assembly optimization). Kneaddata should also work fine here, it just will be unnecessarily slow/resource intensive to run the decontamination step.
# test
kneaddata --input $R1 --input $R2 --run-bmtagger --remove-intermediate-output -db $db_path --output $outdir
# for loop
for ID in `cat acc_list.txt`; do R1=${indir}/${ID}_1.fastq; R2=${indir}/${ID}_2.fastq; kneaddata --input $R1 --input $R2 --run-bmtagger --remove-intermediate-output -t 10 --reference-db $db_path -o $outdir; echo $ID complete.; done
I primarily assemble genomes and metagenomes with SPAdes.
- consider renaming scaffolds for streamlined labeling downstream (and MagicBlast compatibility)
# test
python 01a_Fasta_rename_sequences.py -i ${ID}_scaffolds.fasta -p ${ID}
# for loop
for ID in `cat acc_list.txt`; do python 01a_Fasta_rename_sequences.py -i ${ID}_scaffolds.fasta -p ${ID}; done
Gene prediction at the metagenome, metagenome-assembled genome (MAG), and genome level are performed using Prodigal or PROKKA.
########## (Prodigal example)
# test (use -p flag for metagenome mode)
prodigal -a ${ID}.faa -d ${ID}.fna -f gff -i ${scaffold} -o ${ID}.gff -p meta
# for loop
for ID in `cat acc_list.txt`; do scaffold=${indir}/${ID}/scaffolds.fasta; prodigal -a ${ID}.faa -d ${ID}.fna -f gff -i ${scaffold} -o ${ID}.gff -p meta; echo $ID complete.; done
Genes predicted in step 3 are then analyzed with AMRFinder plus.
AMRFinder installation instructions can be found here.
#test
amrfinder -p ${ID}.faa --plus -o ${outdir}/${ID}_amrfinder.tsv
#loop
for ID in `cat acc_list.txt`; do amrfinder -p ${indir}/${ID}.faa --plus -o ${outdir}/${ID}_amrfinder.tsv; echo $ID complete.; done
Genes predicted in step 3 from a metagenome/genome as well as metagenome/genome reads file and filtered tabular magicblast output file are used to estimate AMRFilter-detected gene RPKM using the CoverageMagic in situ gene coverage workflow.
1. 00_amrfinder_filter.py - this python script was written because AMRFinder produces some hits that are incomplete genes that may reduce confidence of your results. This may be fine in an exploratory analysis, but for my purposes, I prefer to filter only hits that AMRFinder classifies as "ALLELE", "EXACT", "BLASTX", "HMM", which is the default usage.
-
add_partial_end: This script allows users to also include AMRFinder hits that were partial but located at the end of a contig sequence, which could be consistent with a sequencing/assembly issue of a gene that may be complete in host. This option is flagged with the -m add_partial_end option.
-
just_amr: The -j/--just_AMR flag filters and writes a tsv file with just AMR results.
-
virulence_stress: The -v/--virulence_stress flag filters and writes a tsv file with non-AMR results.
usage: 00_amrfinder_filter.py [-h] -i -o [-m] [-j] [-v]
Filter AMRFinder Plus results for high confidence matches.
This script filters AMRFinder output tables for matches, with
default criteria focused on high quality & complete matches.
e.g. >90% identity, >90% match length.
Script options also allow filtering for just AMR determinants,
or conversely, only non-AMR results (e.g. virulence/stress).
optional arguments:
-h, --help show this help message and exit
-i , --input Please specify AMRFinder input tsv file name & path.
-o , --output Please specify AMRFinder filtered prefix & path for
output tsv.
-m , --method Please specify filtered AMRFinder output tsv file name
& path. Select from: complete -or- add_partial_end
-j, --just_AMR Flag to write tsv with just AMR results
-v, --virulence_stress
Flag to write tsv without AMR results (e.g. filter
only virulence, stress)
# Example usage
for ID in `cat ${list_pth}/IDlist.txt`; do echo $ID; python ${script_path}/00_amrfinder_filter.py -i ${indir}/${ID}_amrfinder.tsv -o ${outdir}/${ID} -m add_partial_end -j -v; done
2. 01_amrfinder_binary_matrix.py - this python script searches for .tsv files in an input directory and produces a binary presence/absence matrix for all genes across all samples coded as 0 for absent and 1 as present.
This step is useful for genomes or metagenomes if relative abundance data aren't of primary interest/available.
Note: using relative paths for writing the output tsv can throw an error. To avoid this, use absolute paths or environmental variables instead.
usage: 01_amrfinder_binary_matrix.py [-h] -i INPUT -o OUTPUT [-v]
Create summary matrix of AMRFinder Plus results for plots & analysis.
This script sumamrizes filtered tables from 00_amrfinder_filter.
optional arguments:
-h, --help show this help message and exit
-i , --input Please specify input directory path.
-o , --output Please specify output filename & path.
-v, --verbose Increase output messaging detail, print results.
# Example usage
python ${script_path}/01_amrfinder_binary_matrix.py -i ${00_filtered_path} -o ${01_output_path}/01_binary_matrix.tsv
3. 02_amrfinder_validate_and_summarize_RPKM.py - this python script performs two main tasks for validating and estimating in-situ coverage of (AR) genes of interest in metagenomes. This step is not likely to be useful for isolate genomes, could potentially be useful for isolate transcriptomics data to summarize transcript counts/differential expression.
-
First, a validation step is completed to confirm the presence of all AMRFinder-detected genes in the in situ gene coverage workflow ${unique_id}gene_RPKM.tsv output file.
-
Second, RPKM values are tallied for duplicate gene names within a sample (which is printed to STDOUT if the verbose option is selected) and then used to construct a matrix of RPKM values with rows ov unique genes by columns of metagenomes.
usage: 02_amrfinder_validate_and_summarize_RPKM.py [-h] -a -m -o [-v] [-V]
Validate and summarize RPKM of AMRFinder-detected genes
for plots & analysis.
This script takes the following inputs:
- directory containing filtered AMRFinder tsv files
(output from step 00)
- directory containing ${uniqueID}_gene_RPKM.tsv files
(https://github.com/rotheconrad/00_in-situ_GeneCoverage)
With intermediate validation steps (option -V):
- all genes input in AMRFinder tables are tested against all genes
in the coverage_magic tsv files. If there are any genes that are
not in the submitted coverage_magic tsv files, these are optionally
output as: genes_to_validate.tsv
- all duplicated gene RPKM values are summed by sample.
Input contigs/scaffolds hosting the detected genes are listed with
the summed (deduplicated) RPKM values and gene sequence name and
output as: deduplicated_RPKM.tsv
Genes that have RPKM values are returned with following output:
- specified output file & path containting a single tsv file with
length / effort normalized AMR gene abundance (RPKM)
optional arguments:
-h, --help show this help message and exit
-a , --amrfinder_tsv_path
Please specify directory path containing filtered
AMRFinder tsv files.
-m , --coverage_magic_path
Please specify directory path containing coverage
magic path.
-o , --output Please specify output file path (& optional prefix).
-v, --verbose Toggle volume of printed output.
-V, --validate Write genes_to_validate.tsv and deduplicated.tsv.
# Example usage
python ${script_path}/02_amrfinder_validate_and_summarize_RPKM.py -a ${00_filtered_path} -m ${coverage_magic_path} -o ${02_output_path}/${prefix} -v -V
For the following steps in R, I usually download the tsv files from earlier steps on my personal machine for vizualization/analysis in RStudio.
These steps walk through production of a gene RPKM lineplot and heatmap. Of note, the heatmap step is also a useful way to present presence/absence of gene binary matrix tables from step 6.2 (01_amrfinder_binary_matrix.py) if working with genomes or if gene coverage is not of primary interest.
- Data needs to be loaded in R with package dependencies (after installation).
- The MEP package is available here, and can be installed with devtools.
# load packages
library(tidyverse)
library(pheatmap)
library(viridis)
library(MEP)
# specify default color scales
scale_colour_discrete <- function(...){
scale_colour_viridis(discrete=TRUE, ...)
}
scale_fill_discrete <- function(...){
scale_fill_viridis(discrete=TRUE, ...)
}
# load file paths
RPKM_tsv_path <- "<replace string between quotes with your path>"
metadata_path <- "<replace string between quotes with your path>"
# import tsvs
RPKM_matrix <- read_delim(paste(RPKM_tsv_path, "RPKM_matrix.tsv", sep=""), "\t", escape_double = FALSE, trim_ws = TRUE)
# add matrix rownames for pheatmap
RPKM_m <- RPKM_matrix %>% select(-1)
rownames(RPKM_m) <- RPKM_matrix$X1
# load metadata
metadata <- read_delim(paste(metadata_path, "metadata.tsv", sep=""), "\t", escape_double = FALSE, trim_ws = TRUE)
#############################
# you will need separate dataframes for row & column metadata annotation
# these can be produced by subsetting data from an input metadata file
# or by importing two separate annotation files.
# pheatmap requires these annotation objects to be dataframes with
# rownames that match the names of the columns/rows as appropriate.
#############################
# convert to dataframe with rownames for pheatmap annotation
metadata_m <- metadata %>%
select(-c(1,2)) %>% # select metadata columns of interest
as.data.frame()
# change ${Sample} to variable/name that you will link metadata to by column or row.
rownames(metadata_m) <- metadata$Sample
# get sample prefixes using MEP package function
sample_prefixes <- get_prefix(colnames(RPKM_m), sep="_", HMP=TRUE)
# pivot_longer, without log-transformation
RPKM_mL <- RPKM_m %>%
rownames_to_column(var="Gene.name") %>%
pivot_longer(cols = c(-Gene.name),
names_to="Sample",
values_to="RPKM")
# pivot_longer, log-transformed
RPKM_ml <- log(RPKM_m +1) %>%
rownames_to_column(var="Gene.name") %>%
pivot_longer(cols = c(-Gene.name),
names_to="Sample",
values_to="RPKM")
- Facet-wrapped line plots of gene RPKM values by sample can be a helpful way to show distributions or explore your data.
# facet wrapped lineplots of RPKM by gene, sample, patient
RPKM_lineplot <- RPKM_mL %>% ggplot(aes(x=Gene.name, y=RPKM,
group=Sample, color=Sample)) +
geom_line() +
geom_hline(yintercept=400, linetype="dashed", alpha=0.5, color = "Red") +
facet_wrap(~Sample, ncol=9) +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "none"
)
RPKM_lineplot
# facet wrapped lineplots of log(RPKM) by gene, sample, patient
log_RPKM_lineplot <- RPKM_ml %>% ggplot(aes(x=Gene.name, y=RPKM,
group=Sample, color=Sample)) +
geom_line() +
facet_wrap(~Sample, ncol=9) +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "none"
)
log_RPKM_lineplot
Example output from re-analysis of data from a randomized trial of fecal microbiota transplantation for eradication of carbapenem-resistant (CR) bacterial colonization.
- Pheatmap is an R package that supports clustering, value scaling, and annotation of rows and columns with metadata.
# heatmaps with pheatmap
# plot RPKM heatmap
pheatmap(RPKM_m,
color = magma(5))
# plot log(RPKM) heatmap
pheatmap(log(RPKM_m + 1),
color = magma(5))
#############################
# add column annotation & turn off column clustering
# plot RPKM heatmap
pheatmap(RPKM_m,
color = magma(5),
annotation_col = metadata_m,
cluster_cols = FALSE
)
# plot log(RPKM) heatmap
pheatmap(log(RPKM_m + 1),
color = magma(5),
annotation_col = metadata_m,
cluster_cols = FALSE
)
Example log-transformed heatmap from re-analysis of data from a randomized trial of fecal microbiota transplantation for eradication of carbapenem-resistant (CR) bacterial colonization: .