/TYMEFLIES_Viral

The scripts for the project of "TYMEFLIES_Viral"

Primary LanguagePythonOtherNOASSERTION

TYMEFLIES_Viral

The repository stores scripts for the project of "TYMEFLIES_Viral" - Study of the viral population based on 20-year time series metagenome data from Lake Mendota, Madison, WI, US (metagenomes are obtained from lake water from pelagic integrated epilimnion zone).

Scripts (including some of the inputs/outputs) are placed in the following folders:

1 Process the datasets: Copy fastq file, calculate fastq statistics, and get all metagenome assemblies cov state (or depth) files

2 Identify phages and find active prophages: Identify phages by VIBRANT, find active prophage by PropagAtE, and run CheckV to get phage scaffold quality

(This part is mainly based on the usage of software VIBRANT, PropagAtE, and CheckV)

3 Reconstruct vMAGs: Recontruct vMAGs using vRhyme, get the best set of phage bins using stringent criteria, run CheckV to get phage vMAG quality, and summarize AMGs for all metagenomes

(This part is mainly based on the usage of software vRhyme; we also made a custom script to get the best set of phage bins)

4 Cluster phage genomes: Cluster vMAGs into vOTUs at family-, genus-, and species-level

[The original phage vOTU clustering methods were adopted from two previously published papers: 1) Nat Microbiol. 2021 Jul;6(7):960-970. 2) Nucleic Acids Res. 2021 Jan 8;49(D1):D764-D775. While, since the large number of viral genomes (~1.3 million genomes) in this study, we firstly clustered genomes into family- and genus-level vOTUs using MCL-based method (we also modified the original python script within to reduce the RAM demand for our case), then used dRep to get species-level vOTUs within each genus.]

5 Taxonomic_classification: Classify phage genomes using two methods: NCBI RefSeq viral protein searching and VOG HMM marker searching

(This part is mainly based on the method in Nucleic Acids Res. 2021 Jan 8;49(D1):D764-D775.)

6 Host prediction: Predict host using three approaches: 1) iPHoP-based prediction; 2) prophage scaffold search; 3) match to AMG (auxiliary metabolic gene)

7 Rscript for visualization: Rscripts for a variety of visualization works

8 Mapping metagenomic assemblies: Map reads to metagenomic assemblies (the original scaffolds including both microbial and viral ones) to get MAG/virus abundance

9 Time series analysis - Part 1: Conduct AMG ratio and viral genome coverage analysis

10 Time series analysis - Part 2: Get four important AMG-containing viral genome coverage statistics

11 Time series analysis - Part 3: Get microdiversity analysis results

12 Time series analysis - Part 4: Conduct virus and MAG taxa association analysis

13 Metatranscriptome analysis: Conduct metatranscriptome analysis using different mapping references to see the gene expression pattern

14 [Miscellaneous scripts](https://github.com/AnantharamanLab/TYMEFLIES_Viral/tree/main/Miscellaneous scripts): Contain various auxiliary scripts that were used within the whole project

15 Environmental parameter: Contain the organized tables, original dataset sources, and scripts parsing original datasets

Database processing scripts are placed in the following folders:

1 Database IMGVR : IMG/VR database v4.1 release Dec. 2022 (for Cluster phage genomes)

2 Database NCBI RefSeq viral: NCBI RefSeq viral (2023-01-13 release) (for Taxonomical classification)

3 Database VOG97: VOG97 HMMs Release date Apr 19, 2021 (for Taxonomical classification)

4 Database TYMEFLIES MAGs: MAGs in IMG platform (for Host prediction)