TIP_finder: An HPC software to detect transposable element insertion polymorphism in large genomic datasets

A pipeline that aim to find TIPs activity from TE dynamics, using the methodology proposed by TRACKPOSON [1] accelerating the execution time up to 55 times in huge genomic datasets. TIP_finder applies a parallel strategy to work under HPC techniques efficiently, and has the capacity of scalability over many computational nodes (or servers) and multi-core architectures, which make it especially functional for applications in massive sequencing projects that demands the current (post) genomic era.

Prerequisites

TIP_finder used following bioinformatic software: Bowtie2 (v. 2.3.4.1) in order to map the paired reads of genomic data onto indexed consensus sequence of each TEs/HERVs family, Samtools (v. 1.9) to process bowtie2 output and to keep only unmapped reads, bedtools (v. 2.26.0) to split the reference genome into 10kb windows and to count reads in these windows, and NCBI-blastn (v. 2.10.0) or Magic-BLAST to align the unmapped reads. Please go to Instalation section to see how to install them into an Anaconda environment.

TIP_finder was developed using Python3 (3.8) and following libraries: sys, time, os, subprocess, argparse, MPI4py. In the other hand, TIP_finder_utils.py requires the additional libraries: math, Pandas, matplotlib, Seaborn and SciPy.

Installation:

We highly recommend to use and install Python packages within an Anaconda environment. To create, execute the command below:

conda create -n tip_finder python=3.8

So, activate it

conda activate tip_finder

Then install required Python packages

conda install -c anaconda mpi4py
conda install -c anaconda psutil

For TIP_finder_utils

conda install -c anaconda pandas 
conda install -c conda-forge matplotlib
conda install -c anaconda seaborn

Finally, install prerequisites:

conda install -c bioconda bowtie2
conda install -c bioconda samtools
conda install -c bioconda bedtools
conda install -c bioconda blast
conda install -c bioconda magicblast

Usage:

Previous Steps:

bowtie2-build TYPE_ref_retrotes.fa TYPE_ref_retrotes
makeblastdb -in reference_genome.fasta -dbtype nucl (if you are using magicblast, The -parse_seqids option is required)
bedtools makewindows -g chr_list.txt -w 10000 > reference_genome_10kbwindows.bed
to create the comma-separated read_files.txt, which contains three columns: 1) datasets name, 2) path to the forward-reads file, 3) path to the reverse-reads file.

NOTE:

the chr_list.txt must have following structure (separated by tabs):

Ch1Name<TAB>length
Ch2Name<TAB>length
Ch3Name<TAB>length

The names of the sequences (or chromosomes) must be the same as those in the reference_genome.fasta file, otherwise TIP_finder will not be able to locate the positions of the TIPs, producing empty results.
the read_files.txt must have following structure (separated by commas):

dataset1Name,path_to_forward_reads.fastq,path_to_reverse_reads.fastq
dataset2Name,path_to_forward_reads.fastq,path_to_reverse_reads.fastq
dataset3Name,path_to_forward_reads.fastq,path_to_reverse_reads.fastq
dataset4Name,path_to_forward_reads.fastq,path_to_reverse_reads.fastq

Sequences in TYPE_ref_retrotes.fa file must have unique IDs, otherwise bowtie2 will fail.

TIP_finder execution

mpirun -np num_processes -hosts=server_name python3.8 TIP_finder.py -f file_reads.txt -o folder_results -t TE_family_name -b TYPE_ref_retrotes -l reference_genome.fasta -w reference_genome_10kbwindows.bed

Where num_processes are the number of processors available in your system and server_name is the name of the server where TIP_finder will run.

NOTE

If you want to run TIP_finder using a SLURM job, you can use the following script:

#!/bin/bash

#SBATCH --job-name=TIP_finder
#SBATCH -D /path/to/your/working/directory
#SBATCH --output=output_file.out
#SBATCH -e error_file.err
#SBATCH -n number_of_processors
#SBATCH -N 1

# Remember to load the prerequisites such as bowtie2, samtools, etc (for example using module load if it exists in your system).

conda activate tip_finder

echo "Running in $SLURM_JOB_NODELIST"
mpirun -np number_of_processors -hosts=$SLURM_JOB_NODELIST python3.8 TIP_finder.py -f file_reads.txt -o folder_results -t TE_family_name -b TYPE_ref_retrotes -l reference_genome.fasta -w reference_genome_10kbwindows.bed

TIP_finder_utils execution

TIP_finder_utils provides some utilities to process automatically results generated by TIP_finder, such as:

finalMatrix: It creates a presence/absence matrix joining results from all datasets contained in a folder, and that were processed using the same TE family name. To run it, you must indicate some parameters as follows:

python3 TIP_finder_utils.py -u finalMatrix -t TE_family_name -o /path/to/output/directory -d /path/to/directory/containing/TIP_finder/results -m min_number_of_maps

If the minimum number of maps requiered is not specified, TIP_finder will use as default 5

histograms: It generates histograms of the TIPs found in the results. this utility needs the final matrix file generated by the "finalMatrix" utility of TIP_finder_utils. To run it, you must indicate some parameters such as following:

python3 TIP_finder_utils.py -u histograms -o /path/to/output/directory -f path_to_final_matrix_file

peaks: This utility creates peak charts per each chromosome of the acumulative TIP frequencies of both conditions. This utility divides chromosomes into windows in order to create human-readable charts. To run it, you must indicate some parameters such as following:

python3 TIP_finder_utils.py -u peaks -o /path/to/output/directory -1 path_to_final_matrix_file_condition1 -2 path_to_final_matrix_file_condition2 -w windows_length

If the -w parameter is not specified, TIP_finder will use as default 1000000

association: When datasets from two different conditions are studied, one interesting analysis is to find which TIPs are statistical associated with a certain condition. This utility creates a list of TIPs that are statistically associated with a intesteting condition under a confidence level. To run it, you must indicate some parameters such as following:

python3 TIP_finder_utils.py -u association -o /path/to/output/directory -1 path_to_final_matrix_file_condition1 -2 path_to_final_matrix_file_condition2 -n confidence_level

If the confidence level is not specified, TIP_finder will use as default 0.95

Help

if you need more information about how to run TIP_finder please execute:

python3 TIP_finder.py -h
python3 TIP_finder_utils.py -h

References:

[1] Carpentier, M. C., Manfroi, E., Wei, F. J., Wu, H. P., Lasserre, E., Llauro, C., ... & Panaud, O. (2019). Retrotranspositional landscape of Asian rice revealed by 3000 genomes. Nature communications, 10(1), 1-12.

Citation:

If you used TIP_finder in your research, please cite our paper:

Orozco-Arias, S.; Tobon-Orozco, N.; Piña, J.S.; Jiménez-Varón, C.F.; Tabares-Soto, R.; Guyot, R. TIP_finder: An HPC Software to Detect Transposable Element Insertion Polymorphisms in Large Genomic Datasets. Biology 2020, 9, 281.

simonorozcoarias/TIP_finder