gtonkinhill/panaroo

Error after while processing paralogs?

bpeacock44 opened this issue · 7 comments

I am trying to run four E. coli assemblies through the pipeline. The annotation was done with PGAP, so I use the "--remove-invalid-genes" flag. It seems like all gff3 files get loaded properly but when Processing paralogs begins I run into an error. I'd appreciate some advice. Thank you! Here is the command/output below:

$ panaroo -i in.txt -o results_SeT --clean-mode strict -a core --aligner mafft --remove-invalid-genes
pre-processing gff3 files...
100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:45<00:00, 11.43s/it]
running cmd: cd-hit -T 1 -i results_SeT/combined_protein_CDS.fasta -o results_SeT/combined_protein_cdhit_out.txt -c 0.98 -s 0.98 -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0 -d 999 -g 1 -n 2
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Aug 20 2021, 08:39:56
Command: cd-hit -T 1 -i results_SeT/combined_protein_CDS.fasta
         -o results_SeT/combined_protein_cdhit_out.txt -c 0.98
         -s 0.98 -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0
         -d 999 -g 1 -n 2

Started: Fri Jan 12 10:56:30 2024
================================================================
                            Output
----------------------------------------------------------------
Your word length is 2, using 5 may be faster!
total seq: 0
longest and shortest : 0 and 18446744073709551615
Total letters: 0
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 0M = 0M
Miscellaneous   : 0M
Total           : 10M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 50000000


        0  finished          0  clusters

Approximated maximum memory consumption: 10M
writing new database
writing clustering information
program completed !

Total CPU time 0.01
generating initial network...
Processing paralogs...
Traceback (most recent call last):
  File "/home/borneman/miniconda3/bin/panaroo", line 8, in <module>
    sys.exit(main())
  File "/home/borneman/miniconda3/lib/python3.9/site-packages/panaroo/__main__.py", line 340, in main
    G = collapse_paralogs(G, centroid_contexts, quiet=(not args.verbose))
  File "/home/borneman/miniconda3/lib/python3.9/site-packages/panaroo/clean_network.py", line 338, in collapse_paralogs
    node_count = max(list(G.nodes())) + 10
ValueError: max() arg is an empty sequence

Hi,

It looks like none of the gene sequences have made it to the clustering stage, as indicated by total seq: 0. This is leading to this error and the program crashing.

I would double check your annotation files and ensure they have CDS entries. These are the ones Panaroo makes use of.

Just to make sure - I am using this as my input file, which is the gff file followed by the assembly file on each line. This is the correct input? Thank you again.

K12.gff K12.fna
LF_82.gff LF_82.fna
V3_2.gff V3_2.fna
V3_5.gff V3_5.fna

Hi,

That looks correct to me. There has not been extensive testing on the output of PGAP so it's possible there is a bug in the code.
If possible, it would be very helpful if you could attach a couple of input files (gff/fna pairs) to this issue to help with diagnosing the problem.

Sure thing! I am new to genome annotation so I could very well have done something strange. I've put the files here.

Thanks very much for that. It looks like your fasta sequence names do not match those given in the gff file.
The sequence names in the fasta are prepended by >lcl|. To address this you can use sed. The following command can be used on your fasta file but note that it works inplace and thus will overwrite the originals.

sed -i 's/>lcl|/>/g' K12.fna

I will look at adding a more informative error message to Panaroo for the next release.

Many thanks again!