/RetroSeq

RetroSeq is a bioinformatics tool that searches for mobile element insertions from aligned reads in a BAM file and a library of reference transposable elements. Please read the wiki page (link below) for usage instructions. Also, there is a page on the wiki describing how the 1000 genomes CEU trio was carried out with the files and parameters used for the various steps.

Primary LanguagePerl

RetroSeq

What is RetroSeq?

RetroSeq is a tool for discovery and genotyping of transposable element variants (TEVs) (also known as mobile element insertions) from next-gen sequencing reads aligned to a reference genome in BAM format. The goal is to call TEVs that are not present in the reference genome but present in the sample that has been sequenced. It should be noted that RetroSeq can be used to locate any class of viral insertion in any species where whole-genome sequencing data with a suitable reference genome is available.

If you want to find if a transposable element that is present in the reference genome and not present in your sample (a deletion in structural variation terminology), then you should use a structural variation deletion caller such as Breakdancer.

RetroSeq is a two phase process, the first being the read pair discovery phase where discorandant mate pairs are detected and assigned to a TE class (Alu, SINE, LINE, etc.) by using either the annotated TE elements in the reference and/or aligned with Exonerate to the supplied library of viral sequences.

Dependencies

RetroSeq makes use of the bedtools package, samtools, [Exonerate] (http://www.ebi.ac.uk/~guy/exonerate/), and various common unix tools such as sort and grep. RetroSeq will check if these are installed and available in the default path.

NOTE: RetroSeq has been primarily tested on BAM files produced by MAQ and BWA. There is no guarantee it will work on BAM files from other aligners.

Using RetroSeq

1 Discovery Phase

The goal here is to pass through the BAM and identify discordant read pairs that might support a TE insertion. You can either supply a tab delimited file specifying a set of TE types (e.g. Alu, LINE etc.) and the corresponding BED file of locations where these are in the reference genome (-refTEs parameter). Alternatively, you can provide a tab delimited file specifying a set of viral/TE types and the corresponding fasta file with a set of consensus sequences for these (-eref parameter).

Usage: retroseq.pl -discover -bam -eref -output [-srmode] [-q ] [-id ] [-len -noclean]

-bam        BAM file of paired reads mapped to reference genome
-output     Output file to store candidate supporting reads (required for calling step)
[-eref      Tab file with list of transposon types and the corresponding fasta file of reference sequences (e.g. SINE   /home/me/refs/SINE.fasta). Required when the -align option is used.]
[-refTEs    Tab file with TE type and BED file of reference elements. These will be used to quickly assign discordant reads the TE types and avoid alignment. Using this will speed up discovery dramatically.]
[-noclean   Do not remove intermediate output files. Default is to cleanup.]
[-q         Minimum mapping quality for a read mate that anchors the insertion call. Default is 30. Parameter is optional.]
[-id        Minimum percent ID for a match of a read to the transposon references. Default is 90.]
[-len       Minimum length of a hit to the transposon references. Default is 36bp.]
[-rgs       Comma separated list of readgroups to operate on. Default is all.]
[-exd       Fofn of BED files of regions where discordant mates falling into will be excluded e.g. simple repeats, centromeres, telomeres]
[-align     Do the computational expensive exonerate PE discordant mate alignment step]

The discovery stage will produce an output file (specified by the -output parameter) with lists of read pair names per TE type. Multiple discovery stage outputs can be inputted to the calling phase (i.e. to allow parallelisation of the discovery phase).

2 Calling

The calling phase takes one or more outputs from the discovery phase, clusters the reads, and carries out various checks on the breakpoints to decide if a TEV is present. You can provide a list of locations to ignore per TE type - this would typically be the list of locations of the reference elements of that type (-filter option).

Usage: retroseq.pl -call -bam -input -ref -output [-srinput -filter -cleanup -reads -depth -hets]

-bam            BAM file OR BAM fofn
-input          Either a single output file from the PE discover stage OR a prefix of a set of files from discovery to be combined for calling OR a fofn of discovery stage output files
-ref            Fasta of reference genome
-output         Output file name (VCF)
[-hets          Call heterozygous insertions. Default is homozygous.]
[-filter        Tab file with TE type and BED file of reference elements. These will be filtered out from the calling.]
[-region        Call a particular chromosome only (chr) OR region (chr:start-end) only]
[-depth         Max average depth of a region to be considered for calling. Default is 200.]
[-reads         It is the minimum number of reads required to make a call. Default is 5. Parameter is optional.]
[-q             Minimum mapping quality for a read mate that anchors the insertion call. Default is 30. Parameter is optional.]
[-ignoreRGs     Single read group name OR a file of readgroups that should be ignored. Default is none.]
[-noclean       Do not remove intermediate output files. Default is to cleanup.]

Output

The final TE calls from RetroSeq are in VCF format. The calls are annotated with information on number of supporting reads (GQ tag). The FL tag ranges from 1-8 and gives information on the breakpoint with 8 being the most confident calls and lower values indicating calls that don’t meet the breakpoint criteria for reasons such as lack of 5’ or 3’ reads.

More information on the VCF format and tags is on the tutorial page

Human CEU Trio Example

There is a page on the wiki describing how the CEU human trio was analysed in our paper

Citing RetroSeq

If you find RetroSeq useful, please cite our paper.

Questions or help

For questions or support contact Thomas Keane (tk2 --at-- sanger.ac.uk)