Can't use panaroo of reference genomes annotated with prokka
AnnaLew opened this issue · 4 comments
Hi, I want to use panaroo on reference genomes annotated with prokka, but I am facing an error.
These are the commands I tried:
panaroo -i /panaroo/*.gff -o results --clean-mode strict --remove-invalid-genes
panaroo -i /panaroo/*.gff -o results --clean-mode strict
This is the output:
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.07it/s]
running cmd: cd-hit -T 1 -i results/combined_protein_CDS.fasta -o results/combined_protein_cdhit_out.txt -c 0.98 -s 0.98 -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0 -d 999 -g 1 -n 2
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), May 15 2023, 22:49:31
Command: cd-hit -T 1 -i results/combined_protein_CDS.fasta -o
results/combined_protein_cdhit_out.txt -c 0.98 -s 0.98
-aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0 -d 999
-g 1 -n 2
Started: Mon Feb 19 11:03:57 2024
================================================================
Output
----------------------------------------------------------------
Your word length is 2, using 5 may be faster!
total seq: 6721
longest and shortest : 5627 and 29
Total letters: 2227591
Sequences have been sorted
Approximated minimal memory consumption:
Sequence : 3M
Buffer : 1 X 17M = 17M
Table : 1 X 0M = 0M
Miscellaneous : 0M
Total : 20M
Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 255313500
comparing sequences from 0 to 6721
......
6721 finished 6707 clusters
Approximated maximum memory consumption: 29M
writing new database
writing clustering information
program completed !
Total CPU time 2.38
generating initial network...
Processing paralogs...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 122960.05it/s]
collapse mistranslations...
Processing depth: 1
Iteration: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 531975.57it/s]
Processing depth: 2
Iteration: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 560157.32it/s]
Processing depth: 3
Iteration: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 587669.48it/s]
collapse gene families...
Processing depth: 1
Iteration: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 501966.15it/s]
Processing depth: 2
Iteration: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 572628.27it/s]
Processing depth: 3
Iteration: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 571201.11it/s]
trimming contig ends...
refinding genes...
Number of searches to perform: 0
Searching...
2it [00:03, 1.96s/it]
translating hits...
removing by consensus...
Updating output...
Number of refound genes: 0
collapse gene families with refound genes...
Traceback (most recent call last):
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/bin/panaroo", line 10, in <module>
sys.exit(main())
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/__main__.py", line 438, in main
centroid_to_index=centroid_to_index)[0]
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/clean_network.py", line 104, in collapse_families
node_count = max(list(G.nodes())) + 10
Additionally, when trying to run panaroo on the files from NCBI, I am facing an error:
pre-processing gff3 files...
0%| | 0/2 [00:00<?, ?it/s]Problem reading GFF3 file: /data/leuven/350/vsc35094/extremophiles-thesis/data-thesis/test/panaroo_2/GCF_000006765.1.gff
Error reading prokka input!
Traceback (most recent call last):
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/prokka.py", line 306, in process_prokka_input
for gff_no, gff in job)
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
self._dispatch(tasks)
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/parallel.py", line 819, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 597, in __init__
self.results = batch()
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/parallel.py", line 289, in __call__
for func, args, kwargs in self.items]
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/parallel.py", line 289, in <listcomp>
for func, args, kwargs in self.items]
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/prokka.py", line 143, in get_gene_sequences
raise RuntimeError("Error reading prokka input!")
RuntimeError: Error reading prokka input!
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/bin/panaroo", line 10, in <module>
sys.exit(main())
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/__main__.py", line 327, in main
args.n_cpu, args.table)
File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/prokka.py", line 316, in process_prokka_input
raise RuntimeError("Error reading prokka input!")
RuntimeError: Error reading prokka input!
(thesis) bash-4.4$
I saw others having similar errors, but that solution doesn't apply to my problem. I think the issue is the way in which prokka-outputted gff file is formatted. You can find examples of ncbi and prokka-outputted files here. I would be grateful if you could help me run panaroo on my data!
I forgot to add that I am working with data coming from different species. I am aware that panaroo is not designed to work with multi-species data, but I nevertheless was expecting to at least be able to run it successfully.
Hi,
It looks like your genomes are very divergent which does not suit the default Panaroo parameters. I would also recommend running panaroo in sensitive
mode when comparing species like this.
You could try much more relaxed clustering thresholds such as
panaroo -i GCF_00000*.gff -o results --clean-mode sensitive --remove-invalid-genes --threads 10 --len_dif_percent 0.5 -c 0.8 -f 0.5
However, you may be better using a sequence clustering tool such as Mmseqs2 rather than a pangenome tool given the very large sequence diversity in your set of genomes.
I will also update the code to provide a more informative error message.
Panaroo v1.4.2 now includes a more informative error message.
Thank you for your response! I just want to mention that I ended up using MMseqs2 and it did provide better results, so thank you for your help :)