ERVcaller is a tool designed to accurately detect and genotype non-reference unfixed endogenous retroviruses (ERVs) and other transposon elements (TEs) in the human genome using next-generation sequencing (NGS) data. We evaluated the tool using both simulated and benchmark whole-genome sequencing (WGS) datasets. ERVcaller is capable of accurately detecting various TE insertions of any length, particularly ERVs. It can be applied to both paired-end and single-end WGS, WES, or targeted DNA sequencing data. It supports the use of FASTQ or BAM files(s) generated by different aligners (only BWA, Bowtie were tested). In addition, ERVcaller is capable of detecting insertion breakpoints at single-nucleotide resolution. It allows for the use of either consensus TE sequences or a TE library containing abundant TE sequences as the reference, such as the entire RepBase database. Thus, ERVcaller can be used to detect insertions from highly mutated or novel TE sequences. It is easy to install and use with the command line. Complementary to ERVcaller, other bioinformatics tools designed to detect large deletions may be used to detect TEs that are present in the human reference genome but not in testing samples.
$ tar vxzf ERVcaller_v.1.4.tar.gz
Users need to successfully install the following software separately and make them available in the default search path (such as by using the Linux command “export” or adding them to your .bashrc).
• BWA-0.7.10: http://bio-bwa.sourceforge.net/bwa.shtml
• Samtools-1.6 (or later than 1.2): http://www.htslib.org/doc/samtools.html
• R-3.3.2 (or higher): https://www.r-project.org/
• SE_MEI (Modified version included in the Scripts folder of the ERVcaller installer)
Human reference genome (hg38 by default. If BAM file(s) are used as input, the same build as the reference used for alignment should be used)
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
$ gunzip hg38.fa.gz
$ bwa index hg38.fa
TE reference genome. A TE reference is provided by the ERVcaller installer (i.e., the TE consensus sequences consisting of one Alu, LINE1, SVA, and HERV-K consensus sequence each; the human TE library containing 23 TE sequences; and the ERV library extracted from the Repbase database); or a user-defined TE reference library.
$ cd user_installed_full_path/Database/
$ bwa index TE_consensus.fa
$ export PATH=$PATH:$home/bwa-master/
$ export PATH=$PATH:$home/samtools-1.6/
$ export PATH=$PATH:$home/SE-MEI/
$ export PATH=$PATH:$home/R/
$ perl user_installed_full_path/ERVcaller_v1.4.pl
$ perl user_installed_path/ERVcaller_v1.4.pl -i sample_ID -f .bam -H hg38.fa -T TE_consensus.fa –S 20 -BWA_MEM –t No._threads
$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .bam -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM
$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .fq.gz -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM
$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .list -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM -m
$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .bam -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM -G
The output VCF file (VCFv4.2) will be generated after running. All annotations are listed below:
##fileformat=VCFv4.2
##fileDate=2019121
##source=ERVcaller_v.1.4
##reference=file:hg38.fa
##ALT=<ID=INS:MEI:HERVK,Description="HERVK insertion">
##INFO=<ID=TSD,Number=2,Type=String,Description="NUCLEOTIDE,LEN, Nucleotides and length of the Target Site Duplication (NULL for unknown)">
##INFO=<ID=INFOR,Number=6,Type=String,Description="NAME,START,END,LEN,DIRECTION,STATUS; NULL for unknown values. Status of detected TE: 0 = Inconsistent direction for the supporting reads; 1 = One breakpoint detected by only chimeric and/or improper reads without split reads; 2 = Only one breakpoint is detected and covered by split reads; 3 = Two breakpoints detected, and both of them are not covered by split reads; 4 = Two breakpoints detected, and one of them are not covered by split reads; 5 = Two breakpoints detected, and both of them are covered by split reads;">
##INFO=<ID=CR,Number=1,Type=Integer,Description="Number of chimeric and improper reads support the TE insertion">
##INFO=<ID=SR,Number=1,Type=String,Description="Number of split reads support TE insertion and the breakpoint">
##INFO=<ID=GTF,Number=1,Type=String,Description="If the detected TE insertions genotyped">
##INFO=<ID=GR,Number=1,Type=Float,Description="The ratio of the number of reads support TE insertions versus the total number of reads at this TE insertion location">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype quality (Phred transformed)">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype likelihood">
##FORMAT=<ID=DPI,Number=1,Type=Integer,Description="The number of reads support TE insertions">
##FORMAT=<ID=DPN,Number=1,Type=Integer,Description="The number of reads support non-TE insertions">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TE_seq
chr1 5617379 . T <INS_MEI:HERV> . . TSD=NULL,NULL;INFOR=HERVK,1,7831,7831,+,4;CR=64;SR=3;GTF=YES;GR=1.000 GT:GQ:GL:DPN:DPI 1/1:40:0,0,1:0:67
$ perl user_installed_path/Scripts/Combine_VCF_files.pl -l sample_list -c 1KGP.TE.sites.vcf -o Output_merged.vcf
$ perl user_installed_path/Scripts/Combine_VCF_files.pl -l sample_list -o Output_merged.vcf
Calculate the number of reads support non-insertions at the consensus TE loci per sample (It is recommended to filter out low-quality TE loci from the combined VCF file first before running this script)
$ perl user_installed_path/Scripts/Calculate_reads_among_nonTE_locations.pl -i Output_merged.vcf -S sampleID -o output.nonTE -b bamFile.bam -s paired-end -l length_insertsize -L std_insertsize -r read_length -t threads
Distinguish missing genotypes and non-insertion genotypes at the consensus TE loci to get the final genotypes for all samples
$ cat *.nonTE >nonTE_allsamples
$ perl user_installed_path/Scripts/Distinguish_nonTE_from_missing_genotype.pl -n nonTE_allsamples -v Output_merged.vcf -o Output_merged-final.vcf
You can follow the links listed below for information about downloading and/or installing all the dependent tools except the modified SE_MEI which is already included with ERVcaller:
• BWA-0.7.10: http://bio-bwa.sourceforge.net/bwa.shtml
• Bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
• Samtools-1.6 (or later than 1.2): http://www.htslib.org/doc/samtools.html
• R: https://www.r-project.org/
You can set temporary variables by using the Linux “export” command line before you run ERVcaller every time, or you can modify the shell profile file (ie. .bashrc) for longtime use. You should run for all tools above, except R which is mostly set when installed. For example:
$ export PATH=$PATH:/home/Tools/samtools/
You can download hg38 here: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/. It is recommended that the file hg38.fa.gz is downloaded and unzipped for reference indexing.
Yes, you can. You should be able to use any TE reference sequences specific to your research.
You can find the test input data under the ERVcaller_v.1.4/test/ folder. There is example input data in both BAM and FASTQ format for testing.
There is also an example VCF output file in the folder: ERVcaller_v.1.4/test/example_output/
You can find the full information here: https://samtools.github.io/hts-specs/VCFv4.2.pdf.
The following command line was used to produce the example file:
$ perl ERVcaller_v.1.4.pl -i TE_seq -f .bam -H hg38.fa -T TE_consensus.fa -G
You can use “-t ” to use multi-thread computing. You can skip the genotyping function which can significantly speed up ERVcaller. You may also increase the length of split reads (-S ) to reduce the number of split reads which potentially caused by sequencing errors.
Do we need to provide the full path to the human reference genome and ERV reference genome in the command line, even if they’re in the executable’s directory?
Yes.
Yes.
Yes, we include the TE insertions even within the same type of reference TE sequences。 However the accuracy will be significantly increased through the removal of potential nested TEs.
To keep high qualitly TE insertions, it is important to filter out TE insertions within the same reference TEs using BEDtools and filter out the TE loci with a low genotype quality (e.g., GQ < 10)
ERVcaller is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license. It may be used for non-commercial use only. For inquiries about a commercial license, please contact the first or the corresponding author or The University of Vermont Innovations.
Download: www.uvm.edu/genomics/software/ERVcaller.html
Chen X and Li D. ERVcaller: Identifying and genotyping non-reference unfixed endogenous retroviruses (ERVs) and other transposable elements (TEs) using next-generation sequencing data. Bioinformatics, Volume 35, Issue 20, Pages 3913–3922. https://doi.org/10.1093/bioinformatics/btz205.