viral_snake : a parallelized, system-agnostic snakemake pipeline (and singularity image) for viral calling of metagenomic data
Conclusions: Parallelizing a proven viral calling pipeline and systematizing with snakemake makes for faster, reproducible metagenomic research....
Authors: Daniel Morgan; Haobin Yao; Joshua Ho
Clone current version & run see build file and here
ml singularity
singularity pull --arch amd64 library://dcolinmorgan/vc/viral_calling:v0.1
ml snakemake
cd <working dir>
snakemake --profile profile/ --jobs 8
This analysis was performed on the BIO-ML metagenomics, time-series dataset, from the 2019 Nature Medicine paper (ref. below). Data can be found in serveral places, and collecting and coallating was messy, so I've included files to download the data in this repo.
Poyet, M., Groussin, M., Gibbons, S.M. et al. A library of human gut bacterial isolates paired with longitudinal multiomics data enables mechanistic microbiome research. Nat Med 25, 1442–1452 (2019).
- Metadata
- Raw Fastq Data, in a proper format for ftp download here via this helper file
ml parallel
function ftpall {
wget ftp://$1
}
export -f ftpall
parallel -j 20 ftpall :::: ftp_PRJNA544527.txt
IMPORTANT: singularity image gives user installed versions of required packages, also attained in any pyenv, as follows:
- conda: bioconda pandas megahit numpy prodigal bowtie bbmap hmmer
- conda3.6: dvf python=3.6 numpy theano=1.0.3 keras=2.2.4 scikit-learn Biopython h5py
- pip: MetaPhlan
- github: DeepVirFinder, viralrecall
- Assemble contigs with megahit
- intput: raw paired end fastq.gz files from ftp link ({sample}_1.fastq.gz,{sample}_2.fastq.gz)
- output: NGS assembled contigs ({sample}.fa)
- Identify bacterial contigs with MetaPhlan4 -- confirm data quality in comparison to BIO-ML publication, remove from downstream analysis
- intput: raw paired end fastq.gz files from ftp link ({sample}_1.fastq.gz,{sample}_2.fastq.gz)
- output: metaphlan table ({sample}.txt)
- Identify viral contigs with DeepVirFinder
- intput: assembled contigs (({sample}.fa))
- output: viral contig predictions (final.contigs.fa_gt{params}bp_dvfpred.txt, where params is bp cutoff)
- Predict viral genes with Progical, run script here
- intput: viral contig predictions (final.contigs.fa_gt{params}bp_dvfpred.txt, where params is bp cutoff)
- output: viral genes from predicted contigs (final_contigs.fna)
- ensure contigs are viral via blastp and filtering against viral refseq database
- intput: viral genes from predicted contigs (final_contigs.fna) + refseq viral database
- output: (viral_hits.blast) & filter down to (blast_contigs.fa)
- bowtie and samtools to produce feature table per sample, merge into single table
- intput: (blast_contigs.fa)
- output: (idxstats.txt) per sample & merge
- Remove bacterial contigs and identify viral taxonomy with viralrecall
- intput: format then (cat_blast_contigs.faa)
- output: {sample}/VR_out and merge to get count table (contig_counts.tsv)
- plot metaplhlan abundance
requirements : cudf cuml graphistry cu-cat seaborn
pip install --extra-index-url=https://pypi.nvidia.com cuml-cu11 cudf-cu11 cugraph-cu11 pylibraft_cu11 raft_dask_cu11 dask_cudf_cu11 pylibcugraph_cu11 pylibraft_cu11
pip install -U --force git+https://github.com/graphistry/pygraphistry.git@cudf
pip install -U git+https://github.com/graphistry/cu-cat.git@DT3
nvidia-smi
Following this jupyter notebook for species abundance, umap, time-series analysis
Among other things, these checks are performed herewithin:
Workflow figure from Nature Comms paper previous manuscript
__Figure 1. The workflow of the raw data processing, viral contigs identification, and viral taxonomy annotation. The main workflow of this study was composed of three parts: raw data preprocessing, viral contigs identification, and viral taxonomy annotation. Viral contigs identification involves viral contigs identification and 2 rounds bacterial genome removal. The viral taxonomy annotation also includes taxonomy annotation and NCLDV contigs verification.Wang, L., Yao, H., Morgan, D.C. et al. Altered human gut virome in patients undergoing antibiotics therapy for Helicobacter pylori. Nat Commun 14, 2196 (2023). https://doi.org/10.1038/s41467-023-37975-y