Error after while processing paralogs?

Question

Error after while processing paralogs?

bpeacock44 opened this issue 10 months ago · 7 comments

I am trying to run four E. coli assemblies through the pipeline. The annotation was done with PGAP, so I use the "--remove-invalid-genes" flag. It seems like all gff3 files get loaded properly but when Processing paralogs begins I run into an error. I'd appreciate some advice. Thank you! Here is the command/output below:

$ panaroo -i in.txt -o results_SeT --clean-mode strict -a core --aligner mafft --remove-invalid-genes
pre-processing gff3 files...
100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:45<00:00, 11.43s/it]
running cmd: cd-hit -T 1 -i results_SeT/combined_protein_CDS.fasta -o results_SeT/combined_protein_cdhit_out.txt -c 0.98 -s 0.98 -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0 -d 999 -g 1 -n 2
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Aug 20 2021, 08:39:56
Command: cd-hit -T 1 -i results_SeT/combined_protein_CDS.fasta
         -o results_SeT/combined_protein_cdhit_out.txt -c 0.98
         -s 0.98 -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0
         -d 999 -g 1 -n 2

Started: Fri Jan 12 10:56:30 2024
================================================================
                            Output
----------------------------------------------------------------
Your word length is 2, using 5 may be faster!
total seq: 0
longest and shortest : 0 and 18446744073709551615
Total letters: 0
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 0M = 0M
Miscellaneous   : 0M
Total           : 10M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 50000000


        0  finished          0  clusters

Approximated maximum memory consumption: 10M
writing new database
writing clustering information
program completed !

Total CPU time 0.01
generating initial network...
Processing paralogs...
Traceback (most recent call last):
  File "/home/borneman/miniconda3/bin/panaroo", line 8, in <module>
    sys.exit(main())
  File "/home/borneman/miniconda3/lib/python3.9/site-packages/panaroo/__main__.py", line 340, in main
    G = collapse_paralogs(G, centroid_contexts, quiet=(not args.verbose))
  File "/home/borneman/miniconda3/lib/python3.9/site-packages/panaroo/clean_network.py", line 338, in collapse_paralogs
    node_count = max(list(G.nodes())) + 10
ValueError: max() arg is an empty sequence

Answer 1 · 2024-01-14T03:32:17.000Z

Hi,

It looks like none of the gene sequences have made it to the clustering stage, as indicated by total seq: 0. This is leading to this error and the program crashing.

I would double check your annotation files and ensure they have CDS entries. These are the ones Panaroo makes use of.

Answer 2 · 2024-01-15T21:55:38.000Z

I just checked one of the gff files - I assume you mean entries like this one? NODE_1_length_305759_cov_60.148568 Protein Homology CDS 174 1322 . - 0 ID=cds-pgaptmp_001246;Parent=gene-pgaptmp_001246;Name=extdb:pgaptmp_001246;Ontology_term=GO:0006012,GO:0046835,GO:0004335,GO:0005524;gbkey=CDS;gene=galK;go_function=galactokinase activity|0004335||IEA,ATP binding|0005524||IEA;go_process=galactose metabolic process|0006012||IEA,carbohydrate phosphorylation|0046835||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:WP_001713229.1;locus_tag=pgaptmp_001246;product=galactokinase;protein_id=extdb:pgaptmp_001246;transl_table=11

…

On Sat, Jan 13, 2024 at 9:32 PM Gerry Tonkin-Hill ***@***.***> wrote: Hi, It looks like none of the gene sequences have made it to the clustering stage, as indicated by total seq: 0. This is leading to this error and the program crashing. I would double check your annotation files and ensure they have CDS entries. These are the ones Panaroo makes use of. — Reply to this email directly, view it on GitHub <#264 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHWZYN5BESONU4V7C3CUQOTYONGUXAVCNFSM6AAAAABBYSQPFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJQHAZTKNBVGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 3 · 2024-01-15T21:57:10.000Z

Just to make sure - I am using this as my input file, which is the gff file followed by the assembly file on each line. This is the correct input? Thank you again.

K12.gff K12.fna
LF_82.gff LF_82.fna
V3_2.gff V3_2.fna
V3_5.gff V3_5.fna

Answer 4 · 2024-01-16T23:31:46.000Z

Hi,

That looks correct to me. There has not been extensive testing on the output of PGAP so it's possible there is a bug in the code.
If possible, it would be very helpful if you could attach a couple of input files (gff/fna pairs) to this issue to help with diagnosing the problem.

Answer 5 · 2024-01-17T00:17:43.000Z

Sure thing! I am new to genome annotation so I could very well have done something strange. I've put the files here.

Answer 6 · 2024-01-17T00:52:27.000Z

Thanks very much for that. It looks like your fasta sequence names do not match those given in the gff file.
The sequence names in the fasta are prepended by >lcl|. To address this you can use sed. The following command can be used on your fasta file but note that it works inplace and thus will overwrite the originals.

sed -i 's/>lcl|/>/g' K12.fna

I will look at adding a more informative error message to Panaroo for the next release.

Answer 7 · 2024-01-17T00:58:11.000Z

Many thanks again!