gtonkinhill/panaroo

Can't use panaroo of reference genomes annotated with prokka

AnnaLew opened this issue · 4 comments

Hi, I want to use panaroo on reference genomes annotated with prokka, but I am facing an error.

These are the commands I tried:

panaroo -i /panaroo/*.gff -o results --clean-mode strict --remove-invalid-genes

panaroo -i /panaroo/*.gff -o results --clean-mode strict

This is the output:

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.07it/s]
running cmd: cd-hit -T 1 -i results/combined_protein_CDS.fasta -o results/combined_protein_cdhit_out.txt -c 0.98 -s 0.98 -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0 -d 999 -g 1 -n 2
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), May 15 2023, 22:49:31
Command: cd-hit -T 1 -i results/combined_protein_CDS.fasta -o
         results/combined_protein_cdhit_out.txt -c 0.98 -s 0.98
         -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0 -d 999
         -g 1 -n 2

Started: Mon Feb 19 11:03:57 2024
================================================================
                            Output                              
----------------------------------------------------------------
Your word length is 2, using 5 may be faster!
total seq: 6721
longest and shortest : 5627 and 29
Total letters: 2227591
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 3M
Buffer          : 1 X 17M = 17M
Table           : 1 X 0M = 0M
Miscellaneous   : 0M
Total           : 20M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 255313500

comparing sequences from          0  to       6721
......
     6721  finished       6707  clusters

Approximated maximum memory consumption: 29M
writing new database
writing clustering information
program completed !

Total CPU time 2.38
generating initial network...
Processing paralogs...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 122960.05it/s]
collapse mistranslations...
Processing depth:  1
Iteration:  1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 531975.57it/s]
Processing depth:  2
Iteration:  1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 560157.32it/s]
Processing depth:  3
Iteration:  1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 587669.48it/s]
collapse gene families...
Processing depth:  1
Iteration:  1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 501966.15it/s]
Processing depth:  2
Iteration:  1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 572628.27it/s]
Processing depth:  3
Iteration:  1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6721/6721 [00:00<00:00, 571201.11it/s]
trimming contig ends...
refinding genes...
Number of searches to perform:  0
Searching...
2it [00:03,  1.96s/it]
translating hits...
removing by consensus...
Updating output...
Number of refound genes:  0
collapse gene families with refound genes...
Traceback (most recent call last):
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/bin/panaroo", line 10, in <module>
    sys.exit(main())
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/__main__.py", line 438, in main
    centroid_to_index=centroid_to_index)[0]
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/clean_network.py", line 104, in collapse_families
    node_count = max(list(G.nodes())) + 10

Additionally, when trying to run panaroo on the files from NCBI, I am facing an error:

pre-processing gff3 files...
  0%|                                                                                                                                                                         | 0/2 [00:00<?, ?it/s]Problem reading GFF3 file:  /data/leuven/350/vsc35094/extremophiles-thesis/data-thesis/test/panaroo_2/GCF_000006765.1.gff

Error reading prokka input!
Traceback (most recent call last):
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/prokka.py", line 306, in process_prokka_input
    for gff_no, gff in job)
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    self.results = batch()
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/parallel.py", line 289, in __call__
    for func, args, kwargs in self.items]
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/joblib/parallel.py", line 289, in <listcomp>
    for func, args, kwargs in self.items]
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/prokka.py", line 143, in get_gene_sequences
    raise RuntimeError("Error reading prokka input!")
RuntimeError: Error reading prokka input!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/bin/panaroo", line 10, in <module>
    sys.exit(main())
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/__main__.py", line 327, in main
    args.n_cpu, args.table)
  File "/data/leuven/350/vsc35094/miniconda3/envs/thesis/lib/python3.6/site-packages/panaroo/prokka.py", line 316, in process_prokka_input
    raise RuntimeError("Error reading prokka input!")
RuntimeError: Error reading prokka input!
(thesis) bash-4.4$ 

I saw others having similar errors, but that solution doesn't apply to my problem. I think the issue is the way in which prokka-outputted gff file is formatted. You can find examples of ncbi and prokka-outputted files here. I would be grateful if you could help me run panaroo on my data!

I forgot to add that I am working with data coming from different species. I am aware that panaroo is not designed to work with multi-species data, but I nevertheless was expecting to at least be able to run it successfully.

Hi,

It looks like your genomes are very divergent which does not suit the default Panaroo parameters. I would also recommend running panaroo in sensitive mode when comparing species like this.

You could try much more relaxed clustering thresholds such as

panaroo -i GCF_00000*.gff -o results --clean-mode sensitive --remove-invalid-genes --threads 10 --len_dif_percent 0.5 -c 0.8 -f 0.5

However, you may be better using a sequence clustering tool such as Mmseqs2 rather than a pangenome tool given the very large sequence diversity in your set of genomes.

I will also update the code to provide a more informative error message.

Panaroo v1.4.2 now includes a more informative error message.

Thank you for your response! I just want to mention that I ended up using MMseqs2 and it did provide better results, so thank you for your help :)