VirTect is a computational tool that can use to detect virus from RNA-Seq on human samples.
VirTect is an efficient software tool for virus detection. VirTect take NGS data as a input in FASTQ format and mapped to human reference genome using tophat. After the subtraction of non-human sequence from the human sequence, VirTect used bwa-men command to align the non-human sequence to our defined 757 different viruses database to report the virus. After alignment of non-human sequence to virus database, VirTect do the filtrations to discriminate the viral sequence from the noise or artifact and finally report the real viruses.
Here is an example that how VirTect works, for the HCC sample, we have about 53 million paired reads, VirTect mapped about 51 of 53 million reads (about 96.7%) to human reference and subtracted the remaining about 2 millions, the non-human reads from the human sequence. Then VirTect mapped the non-human sequence to virues geomes, before filtrations, thousands of reads are mapped to different viruses in our defined virus database such as mapped to tick borne encephalitis, hepatitis C, cutthroat trout, and hepatitis B etc., however, only hepatitis B passed VirTect filtrations. Also we examined some of the virus, which did not pass our filtrations, however significant number of non-human reads mapped to them and we found that it is not a real viral sequence, however, it is mapped to poly(A) sequence of hepatitis C genotype 1.
This is a GitHub repository for the documentation of the VirTect software, described in the paper listed below. If you like this repository, please click on the "Star" button on top of this page, to show appreciation to the repository maintainer. If you want to receive notifications on changes to this repository, please click the "Watch" button on top of this page.
First we need to install the following publicly available tools to run VirTect:
cutadapt (http://cutadapt.readthedocs.io/en/stable/guide.html)
tophat (https://ccb.jhu.edu/software/tophat/index.shtml)
bwa (http://bio-bwa.sourceforge.net/bwa.shtml)
bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
samtools (http://samtools.sourceforge.net/)
bedtools (http://bedtools.readthedocs.io/en/latest/)
FASTQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
IGV (http://software.broadinstitute.org/software/igv/)
DAVID (https://david.ncifcrf.gov/)
TCGA (https://cancergenome.nih.gov/)
Human Papillomavirus (HPV) (https://pave.niaid.nih.gov/#home)
Please clone the repository into your computer:
git clone https://github.com/WGLab/VirTect
Then enter VirTect directory:
cd VirTect
Before, using VerTict, highly suggested to do triming of the data to make sure the quality of the data. To trim the data, we need to use the following code to trim the data, if you already trim the data, then no need to do the trimming again.
python VerTect_cutadapt.py --help
python VerTect_cutadapt.py -1 Reads_1.fq -2 Reads_2.fq -F "AGATCGGAAGAG" -R "AGATCGGAAGAG"
Here -F and -R are forward and reverse adapters, however, you may change them if your adapter is different than above.
First, we need to download the fasta file, if you don't have the human fasta file, however, only run this code for first time to download the fasta file and geneate the index for fasta file.
python download_fasta_index_v0.0.2 --help
usage: download_fasta_index_v0.0.2 [-h] -buildver The genome version
[-index created index file]
Downloaded the fasta file and generate the index file
optional arguments:
-h, --help show this help message and exit
-buildver The genome version, --hg The genome version
The version of the genome
-index created index file, --index created index file
Create index file
python download_fasta_index_v02.py -buildver hg38/hg19 -index NO
python download_fasta_index_v02.py -buildver hg38/hg19 -index YES
This will downlaod the fasta, gencode gtf files, and also will generate the index file and will save in human_reference directory this should be run for first time. It will download the hg38 or hg19 fasta file based on user input i.e., if you want to download hg38 fasta file then the command will be: python download_fasta_index_v02.py -buildver hg38 and same for hg19. Also we added to extra flag if need only to download the fasta file or GTF.
We already download and generate the index for each of the virus in our virus database, which are saved in viruses_reference directory, which can be used directly from this directory.
Fianly, run the VirTect for virus detection from human RNA-seq samples.
-
-h, --help show this help message and exit
-
--version show program''s version number and exit
-
-t Number of threads, default: 8, --n_thread Number of threads, default: 8 Number of threads
-
-1 read1.fastq, --fq1 read1.fastq The read 1 of the paired end RNA-seq
-
-2 read2.fastq, --fq2 read2.fastq The read 2 of the paired end RNA-seq
-
-o The output name for alignement, --out The output name for alignement Define the output directory to be stored the alignement results
-
-ucsc_gene gtf, --gtf gtf The input gtf file
-
-index index files, --index_dir index files The directory of index files with hg38 prefix of the fasta file i.e,. index_files_directory/hg38
-
-index_vir virus fasta, --index_vir virus fasta The fasta file of the virus genomes
-
-d continuous_distance, --distance continuous_distance Define the continuous mapping distance of mapping reads to virus genome
python VirTect.py --help
python VirTect.py -t 12 -1 Reads_1.fq -2 Reads_2.fq -o Test -ucsc_gene human_reference/gencode.v25.chr_patch_hapl_scaff.annotation.gtf -index human_reference/GRCh38.p12.genome -index_vir viruses_reference/viruses_757.fasta -d 200
After the running VerTect, we will have the final viruses file Final_continous_region.txt, if the sample has some virus/viruses. The continuous distance mapping distance virus genome depends on user input length of reads. Please follow VirTect updated version, since we are working on it to parallelize VirTect for multiple samples in same time.
If you already generated index, just need the following command to run the VirTect:
bash Run_test_VirTect.sh
After virus detection from the samples, we may need to know that which gene/transcript is expressed in specfic virus, we need to do the Virus expression count. We need to run the following code to generate the count file. We have only viurs annotations for some HPV virus, however, we will work on it provide the annotation file for each of the virus in our virus database.
python VerTect_count_expression.py --help
Still we are working on it to provide annotations for each of the virus in our virus database.
By using the software, you acknowledge that you agree to the terms below:
For academic and non-profit use, you fell free to fork, download, modify, distribute and use the software without restriction.
Atlas Khan (ak4046@cumc.columbia.edu)
Kai Wang (kaichop@gmail.com)
Khan A, Liu Q, Chen X, Stucky A, Parish P. Sedghizadeh, Adelpour D, Zhang X, Wang K, Zhong JF, Detection of human papillomavirus in cases of head and neck squamous cell carcinoma by RNA-seq and VirTect, Mol Oncol. 2018 Dec 30. doi: 10.1002/1878-0261.12435.
Wang Genomics Lab Homepage (http://wglab.org/)