ngs_te_mapper2: A program to identify transposable element insertions using next generation sequencing data
ngs_te_mapper2 is a re-implementation of the method for detecting transposable element (TE) insertions from next-generation sequencing (NGS) data originally described in Linheiro and Bergman (2012) PLoS ONE 7(2): e30008. ngs_te_mapper2 uses a three-stage procedure to annotate non-reference TEs as the span of target site duplication (TSD), following the framework described in Bergman (2012) Mob Genet Elements. 2:51-54.
In the first stage, WGS reads are queried against a library of TE sequences to identify 'junction reads' that span the start/end of TE and genomic flanking sequences are retained. Such reads are often referred as 'split reads', although in reality these reads are not split in the resequenced genome.
In the second stage, junction reads from each side of TE insertion identified in the first stage are separately aligned to a reference genome that is hard-masked with RepeatMasker (http://www.repeatmasker.org/) using the same TE library from stage one. Genome-wide coverage profiles are computed using samtools v1.9 and genomic intervals with enriched coverage from junction read clusters on the 5' and 3' side of TEs are annotated in bed format. Regions of overlap between intervals of junction read clusters on the 5' and 3' side of TEs define the locations of the TSDs for candidate non-reference TE insertions. The orientation of the TE is determined from the relative orientation of alignments of the junction reads to the reference genome and TE library.
In the third stage, all reads from the original whole genome shotgun sequence data are used to query against the same hard-masked reference genome as in stage two. This additional mapping step is necessary to obtain all reads that span the TE-flank junction, as well as identify if any reads are present for the alternative ``reference" haplotype that does not carry the TE insertion. For each candidate non-reference TE insertion site, number of junction reads covering 5' and 3' side of each candidate TE insertion are estimated as the number of soft-clipped reads overlapping a 10bp window on the 5' and 3' side of the TSD, respectively (Count_junction5' and Count_junction3'). Number of non-reference reads (Count_non_ref) were estimated as max(Count_junction5', Count_junction3'). Number of reference reads (Count_ref) were estimated as number of non-soft-clipped reads spanning the TSD with at least 3bp extension on both side. The allele frequency for non-reference TEs is heuristically estimated as Count_non-ref/(Count_non_ref + Count_ref).
Reference TE insertions are detected using a similar strategy to non-reference insertions, independently of any reference TE annotation. The first stage in detecting reference TE insertions is identical to the first stage of detecting non-reference TE insertions described above. The second stage in identifying reference TE insertions involves alignment of the renamed, but otherwise unmodified, junction reads to the reference genome. Alignments of the complete junction read (i.e. non-TE and TE components) are clustered to identify the two ends of the reference TE insertion. The orientation of the reference TE is then determined from the relative orientation of alignments of the junction reads to the reference genome and TE library.
ngs_te_mapper2 is written in python3 and is designed to run on a Linux operating system. Installation of software dependencies for ngs_te_mapper2 is automated by Conda, thus a working installation of Conda is required to install ngs_te_mapper2. Conda can be installed via the Miniconda installer.
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O $HOME//miniconda.sh
bash ~/miniconda.sh -b -p $HOME/miniconda # silent mode
echo "export PATH=\$PATH:\$HOME/miniconda/bin" >> $HOME/.bashrc # add to .bashrc
source $HOME/.bashrc
conda init
conda init
requires you to close and open a new terminal before it take effect
conda update conda
After installing and updating Conda, you can now use conda to install dependencies and create running environment for ngs_te_mapper2.
git clone git@github.com:bergmanlab/ngs_te_mapper2.git
cd ngs_te_mapper2
conda env create -f envs/ngs_te_mapper2.yml
- This installs all the software dependencies needed to run ngs_te_mapper2 into the ngs_te_mapper2 Conda environment.
conda activate ngs_te_mapper2
- This adds the dependencies installed in the ngs_te_mapper2 conda environment to the environment PATH so that they can be used by the ngs_te_mapper2 scripts.
- This environment must always be activated prior to running any of the ngs_te_mapper2 scripts
- NOTE: Sometimes activating conda environments does not work via conda activate myenv when run through a script submitted to a queueing system, this can be fixed by activating the environment in the script as shown below
CONDA_BASE=$(conda info --base)
source ${CONDA_BASE}/etc/profile.d/conda.sh
conda activate ngs_te_mapper2
- For more on Conda: see the Conda User Guide.
- A test dataset is provided in the
test/
directory, you can test whether your ngs_te_mapper2 installation is successful by running ngs_te_mapper2 on this dataset, which should take less than one minute to finish on a single thread machine.
conda activate ngs_te_mapper2
cd test
python3 ../sourceCode/ngs_te_mapper2.py -o test_output -f reads.fastq -r ref_1kb.fasta -l library.fasta
- FASTQ File (
-f/--reads
)- Raw reads from paired-end or single-end sequencing run in fastq or fastq.gz format.
- Multiple fastq/fastq.gz files can be provided separated by comma (ep.
-f R1.fasta,R2.fasta
).
- TE library FASTA (
-l/--library
)- A FASTA file containing a consensus sequence for each TE family. Note: Each family should only be represented in one sequence in this file.
- Example consensus FASTA file
- Reference FASTA (
-r/--reference
)- The genome sequence of the reference genome in FASTA format.
usage: ngs_te_mapper2.py [-h] -f READS -l LIBRARY -r REFERENCE [-a ANNOTATION]
[-n REGION] [-w WINDOW] [--min_mapq MIN_MAPQ]
[--min_af MIN_AF] [--tsd_max TSD_MAX]
[--gap_max GAP_MAX] [-m MAPPER] [-t THREAD] [-o OUT]
[-p PREFIX] [-k]
Script to detect non-reference TEs from single end short read data
required arguments:
-f READS, --reads READS
raw reads in fastq or fastq.gz format, separated by
comma
-l LIBRARY, --library LIBRARY
TE concensus sequence
-r REFERENCE, --reference REFERENCE
reference genome
optional arguments:
-h, --help show this help message and exit
-a ANNOTATION, --annotation ANNOTATION
reference TE annotation in GFF3 format (must have
'Target' attribute in the 9th column)
-w WINDOW, --window WINDOW
merge window for identifying TE clusters (default =
10)
--min_mapq MIN_MAPQ minimum mapping quality of alignment (default = 20)
--min_af MIN_AF minimum allele frequency (default = 0.1)
--tsd_max TSD_MAX maximum TSD size (default = 25)
--gap_max GAP_MAX maximum gap size (default = 5)
-t THREAD, --thread THREAD
thread (default = 1)
-o OUT, --out OUT output dir (default = '.')
-p PREFIX, --prefix PREFIX
output prefix
-k, --keep_files If provided then all intermediate files will be kept
(default: remove intermediate files)
Note: The optional reference TE annotation input should in theory speed up the program. ngs_te_mapper2 expects the TE annotation to be in GFF3 format and Target
attribute must be included in the 9th column that represents TE family name. If you have *.out
annotation generated by RepeatMasker, you can use this utility script to convert from *.out
to GFF3 format.
ngs_te_mapper2 outputs reference and non-referece TE insertion predictions in BED format (0-based).
<sample>.nonref.bed
: non-reference TE insertion annotation predicted by ngs_te_mapper2 pipeline in BED format (0-based).<sample>.ref.bed
: reference TE insertion annotation predicted by ngs_te_mapper2 pipeline in BED format (0-based).
ngs_te_mapper2 generates standard BED file <sample>.nonref.bed
and <sample>.ref.bed
that have detailed information for each reference and non-reference TE insertion.
Column | Description |
---|---|
chromosome | The chromosome name where the TE insertion occurred |
position | Starting breakpoint position of the TE insertions. |
end | Ending breakpoint position of the TE insertions. |
info | Includes TE family, TSD, Allele Frequency, 3' support, 5' support and reference reads. Separated by '|'. |
score | '.' |
strand | Strand that TE insertion occurs |
For each ngs_te_mapper2 run, a log file called <sample>.log
is generated that records all the major steps in the program and error messages.
Please use the Github Issue page if you have questions.
To cite ngs_te_mapper2 in publications, please use:
Linheiro, R.S. and C.M. Bergman (2012) Whole Genome Resequencing Reveals Natural Target Site Preferences of Transposable Elements in Drosophila melanogaster. PLoS ONE 7(2): e30008 http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0030008
Copyright (c) 2020 Shunhua Han and Casey M. Bergman
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.