clemgoub/TE-Aid

Issue when running TE-Aid in parallel

manighanipoor opened this issue · 2 comments

Hi,

I need to run TE-Aid in parallel but it causes errors because of using shared resources.
I tried this command (to copy TE-Aid to a temp file for each process so it doesn't use the same database) in a HPC cluster in parallel but it does not work for all processes:

GENOME="../aipysurus_laevis.polished.fa"
TEAID="/hpcfs/users/a1177955/local/TE-Aid/"
parallel --bar --jobs 3 -a fasta_list.txt "mkdir -p ./tmp/{#}/TE-Aid && mkdir -p ./tmp/{#}/output && cp -ar $TEAID/* ./tmp/{#}/TE-Aid/ && ln -sf $(realpath $GENOME) ./tmp/{#}/genome_file && ./tmp/{#}/TE-Aid/TE-Aid -q {} -g ./tmp/{#}/genome_file -o ./tmp/{#}/output && mv ./tmp/{#}/output/* ./" && rm -r ./tmp/

and this is what I got (it just worked with process 1 and gave error for processes 2 and 3):

0% 0:3=0s fasta_3.fa query: fasta_2.fa
ref genome: ./tmp/2/genome_file
TE -> genome blastn e-value: 10e-8
full length min ratio: 0.9
hits transparency: 0.3
full length hits transparency: 0.9
no ORF detected, skipping blastp...
[1] "R: ploting genome blastn results and computing coverage..."
[1] "consensus length: 360 bp"
[1] "R: ploting self dot-plot and orf/protein hits..."
[1] "no orf to plot..."
null device
1
Done! The graph (.pdf) can be found in the output folder: ./tmp/2/output
Warning message:
In file(file, "rt") :
cannot open file './tmp/2/output/orftetable': No such file or directory
33% 1:2=31s fasta_3.fa query: fasta_1.fa
ref genome: ./tmp/1/genome_file
TE -> genome blastn e-value: 10e-8
full length min ratio: 0.9
hits transparency: 0.3
full length hits transparency: 0.9
RepeatPeps is downloaded and formatted, blastp-ing...
[1] "R: ploting genome blastn results and computing coverage..."
[1] "consensus length: 1582 bp"
[1] "R: ploting self dot-plot and orf/protein hits..."
null device
1
Done! The graph (.pdf) can be found in the output folder: ./tmp/1/output
66% 2:1=11s fasta_3.fa query: fasta_3.fa
ref genome: ./tmp/3/genome_file
TE -> genome blastn e-value: 10e-8
full length min ratio: 0.9
hits transparency: 0.3
full length hits transparency: 0.9
no ORF detected, skipping blastp...
[1] "R: ploting genome blastn results and computing coverage..."
[1] "consensus length: 541 bp"
[1] "R: ploting self dot-plot and orf/protein hits..."
[1] "no orf to plot..."
null device
1
Done! The graph (.pdf) can be found in the output folder: ./tmp/3/output
Warning message:
In file(file, "rt") :
cannot open file './tmp/3/output/orftetable': No such file or directory
100% 3:0=0s fasta_3.fa

would you please let me know what the solution is?

Cheers,
Mani

Hi Mani,

First of all, as far as I know, TE-Aid wasn't made for running in parallel. The basic output of this tool is a pdf plot that you have to inspect manually, which is not feasible for multitude of TEs. In other words, TE-Aid was designed to work with a specific consensus for getting an overview of its structure and genome representation.
Second, in order to maximize the speed without running TE-Aid in parallel and avoid potential collisions, you could just loop over your fastas with a bash script while using the same output folder. If your files and corresponding fasta headers have different names that should work fine and you won't download/generate BLAST databases for each fasta. I haven't worked with X laevis, but for danio, which has genome two times smaller, it takes ~15 seconds to run TE-Aid, when databases are prepared, so it shouldn't be as bad as well for your clawed friend. Anyhoo, I would just submit a bash script to your cluster that loops over your fastas:

#!/usr/bin/env bash
#SBATCH parameters or whatever HPC control system you have 
GENOME=/path/to/genome

for fa in ./*.fasta
do
    TE-Aid -q ${fa} -g ${GENOME} -o output_folder
done

And thirdly, the formatting of the parallel command you wrote in your question is broken. That makes it harder to read it and understand.

Cheers,
Artem

Hi,
thanks, I could resolve the issue.