Increase in processing time
anagaga27 opened this issue · 4 comments
Hello,
I'm trying to run panaroo (without alignment) with 22416 genomes.
Although I'm using 20 threads, it seems like the process is just parallelised at the beginning, and I'm afraid it is gonna take too long.
Do you think it is feasible to use panaroo with such a big dataset (22416 GFFs) ?
Thank you in advance.
Hi, thanks for asking!
This depends a bit on the size and diversity of the genomes, but in general >10,000 genomes is pushing the limits of what can be run if there is a time constraint. As you have identified, there is unfortunately a single threaded bottleneck during the pangenome graph construction of panaroo which is not easily parallelised (see #126 ).
What we recommend in these situations is to split up the dataset into subsets which will run within the time constraint, and then running panaroo-merge
on the resulting output to combine the output. Depending on how big/diverse your dataset is, these subsets could be quite large (>5000 isolates).
Does this make sense? Let us know if you have any questions about this or run into any difficulty!
Thank you so much for your quick reply!!
I think I will do that and split the dataset but I've seen there is the parameter "--merge_paralogs".
Maybe I don't understand exactly what it does but, wouldn't it be also an option to reduce time processing as it is the bottleneck?
It's not been benchmarked to my knowledge, but I wouldn't expect --merge_paralogs
to have a substantial impact on the runtime. Most of the graph operations (not just dealing with paralogs) need to be done serially, because the outcome of one can effect subsequent operations.
Thank you for you answer.
I have split the dataset and ran panaroo on 2239 genomes, but this error came out:
Traceback (most recent call last):
File "/home/agargam/anaconda3/bin/panaroo", line 10, in
sys.exit(main())
File "/home/agargam/anaconda3/lib/python3.9/site-packages/panaroo/main.py", line 371, in main
G = collapse_families(G,
File "/home/agargam/anaconda3/lib/python3.9/site-packages/panaroo/clean_network.py", line 113, in collapse_families
cdhit_clusters = iterative_cdhit(G,
File "/home/agargam/anaconda3/lib/python3.9/site-packages/panaroo/cdhit.py", line 411, in iterative_cdhit
run_cdhit_est(input_file=temp_input_file.name,
File "/home/agargam/anaconda3/lib/python3.9/site-packages/panaroo/cdhit.py", line 147, in run_cdhit_est
subprocess.run(cmd, shell=True, check=True)
File "/home/agargam/anaconda3/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'cd-hit-est -T 10 -i /home/agargam/TFM/Archivos gff/Archivos_GFF/output_moderate_CV_Schmerer_Salmeron_Carlos_Green/tmpl_2byl3b/tmp7shrgnbs -o /home/agargam/TFM/Archivos gff/Archivos_GFF/output_moderate_CV_Schmerer_Salmeron_Carlos_Green/tmpl_2byl3b/tmp7od3jxef -c 0.99 -s 0.0 -aL 0.0 -AL 99999999 -aS 99999999 -AS 99999999 -r 1 -M 0 -d 999 -mask NX -n 7 > /dev/null' returned non-zero exit status 1.
I think is related to the temporary folder "tmpl_2byl3b", but I don't understand where is the problem.
Have you dealt with this type of problem before?
Thank you.