Issue when running TE-Aid in parallel
manighanipoor opened this issue · 2 comments
Hi,
I need to run TE-Aid in parallel but it causes errors because of using shared resources.
I tried this command (to copy TE-Aid to a temp file for each process so it doesn't use the same database) in a HPC cluster in parallel but it does not work for all processes:
GENOME="../aipysurus_laevis.polished.fa"
TEAID="/hpcfs/users/a1177955/local/TE-Aid/"
parallel --bar --jobs 3 -a fasta_list.txt "mkdir -p ./tmp/{#}/TE-Aid && mkdir -p ./tmp/{#}/output && cp -ar
and this is what I got (it just worked with process 1 and gave error for processes 2 and 3):
0% 0:3=0s fasta_3.fa query: fasta_2.fa
ref genome: ./tmp/2/genome_file
TE -> genome blastn e-value: 10e-8
full length min ratio: 0.9
hits transparency: 0.3
full length hits transparency: 0.9
no ORF detected, skipping blastp...
[1] "R: ploting genome blastn results and computing coverage..."
[1] "consensus length: 360 bp"
[1] "R: ploting self dot-plot and orf/protein hits..."
[1] "no orf to plot..."
null device
1
Done! The graph (.pdf) can be found in the output folder: ./tmp/2/output
Warning message:
In file(file, "rt") :
cannot open file './tmp/2/output/orftetable': No such file or directory
33% 1:2=31s fasta_3.fa query: fasta_1.fa
ref genome: ./tmp/1/genome_file
TE -> genome blastn e-value: 10e-8
full length min ratio: 0.9
hits transparency: 0.3
full length hits transparency: 0.9
RepeatPeps is downloaded and formatted, blastp-ing...
[1] "R: ploting genome blastn results and computing coverage..."
[1] "consensus length: 1582 bp"
[1] "R: ploting self dot-plot and orf/protein hits..."
null device
1
Done! The graph (.pdf) can be found in the output folder: ./tmp/1/output
66% 2:1=11s fasta_3.fa query: fasta_3.fa
ref genome: ./tmp/3/genome_file
TE -> genome blastn e-value: 10e-8
full length min ratio: 0.9
hits transparency: 0.3
full length hits transparency: 0.9
no ORF detected, skipping blastp...
[1] "R: ploting genome blastn results and computing coverage..."
[1] "consensus length: 541 bp"
[1] "R: ploting self dot-plot and orf/protein hits..."
[1] "no orf to plot..."
null device
1
Done! The graph (.pdf) can be found in the output folder: ./tmp/3/output
Warning message:
In file(file, "rt") :
cannot open file './tmp/3/output/orftetable': No such file or directory
100% 3:0=0s fasta_3.fa
would you please let me know what the solution is?
Cheers,
Mani
Hi Mani,
First of all, as far as I know, TE-Aid wasn't made for running in parallel. The basic output of this tool is a pdf plot that you have to inspect manually, which is not feasible for multitude of TEs. In other words, TE-Aid was designed to work with a specific consensus for getting an overview of its structure and genome representation.
Second, in order to maximize the speed without running TE-Aid in parallel and avoid potential collisions, you could just loop over your fastas with a bash script while using the same output folder. If your files and corresponding fasta headers have different names that should work fine and you won't download/generate BLAST databases for each fasta. I haven't worked with X laevis, but for danio, which has genome two times smaller, it takes ~15 seconds to run TE-Aid, when databases are prepared, so it shouldn't be as bad as well for your clawed friend. Anyhoo, I would just submit a bash script to your cluster that loops over your fastas:
#!/usr/bin/env bash
#SBATCH parameters or whatever HPC control system you have
GENOME=/path/to/genome
for fa in ./*.fasta
do
TE-Aid -q ${fa} -g ${GENOME} -o output_folder
done
And thirdly, the formatting of the parallel command you wrote in your question is broken. That makes it harder to read it and understand.
Cheers,
Artem
Hi,
thanks, I could resolve the issue.