labgem/PPanGGOLiN

Writing gene-related data failed

Closed this issue · 3 comments

Hi, first of all, thanks for creating and supporting this amazing software, it's been very helpful so far.

I am doing a pangenome from several E. coli strains we have sequenced in our lab. I got their annotation using bakta with the latest complete db (5.1), and then fed these annotations to the complete workflow:

ppanggolin all --anno genomes.gbff.txt --output ppanggolin_results -c 2 --verbose 2 -f

However, when it comes to writing all gene-data in the h5f file I'm getting an error related to the object class:

Traceback (most recent call last):
  File "tables/tableextension.pyx", line 1676, in tables.tableextension.Row.__setitem__
TypeError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/bin/ppanggolin", line 10, in <module>
    sys.exit(main())
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/main.py", line 219, in main
    ppanggolin.workflow.all.launch(args)
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/workflow/all.py", line 288, in launch
    launch_workflow(args, panrgp=True, panmodule=True)
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/workflow/all.py", line 61, in launch_workflow
    write_pangenome(pangenome, filename, args.force, disable_bar=args.disable_prog_bar)
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/formats/writeBinaries.py", line 711, in write_pangenome
    write_annotations(pangenome, h5f, disable_bar=disable_bar)
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/formats/writeAnnotations.py", line 342, in write_annotations
    write_genedata(pangenome, h5f, annotation, genedata2gene, disable_bar)
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/formats/writeAnnotations.py", line 309, in write_genedata
    genedata_row["name"] = genedata.name
  File "tables/tableextension.pyx", line 1681, in tables.tableextension.Row.__setitem__
TypeError: invalid type (<class 'str'>) for column ``name``
/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/tables/file.py:113: UnclosedFileWarning:

Closing remaining open file: ppanggolin_results/pangenome.h5

Here is the complete output from the run.

2024-05-30 12:39:04 utils.py:l168 INFO	Command: /rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/bin/ppanggolin all --anno genomes.gbff.txt --output ppanggolin_results -c 2 --verbose 2 -f
2024-05-30 12:39:04 utils.py:l169 INFO	PPanGGOLiN version: 2.0.5
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--anno: genomes.gbff.txt" has been specified in the command line with a non-default value. Its value overwrites the default value (None).
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--force: True" has been specified in the command line with a non-default value. Its value overwrites the default value (False).
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--output: ppanggolin_results" has been specified in the command line with a non-default value. Its value overwrites the default value (ppanggolin_output_DATE2024-05-30_HOUR12.39.04_PID2061566).
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--verbose: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l668 DEBUG	4 all parameters have non-default value: cpu=2, force=True, output=ppanggolin_results, verbose=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing annotate arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 annotate parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing cluster arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 cluster parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing graph arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing partition arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 partition parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing rarefaction arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 rarefaction parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing rgp arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing spot arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing module arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 module parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing draw arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing write_pangenome arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 write_pangenome parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing write_genomes arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 write_genomes parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l722 INFO	11 parameters have a non-default value.
2024-05-30 12:39:04 annotate.py:l503 INFO	Reading genomes.gbff.txt the list of genome files ...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  5.76file/s]
2024-05-30 12:39:06 annotate.py:l535 INFO	gene identifiers used in the provided annotation files were unique, PPanGGOLiN will use them.
2024-05-30 12:39:06 writeBinaries.py:l709 INFO	Writing genome annotations...
2024-05-30 12:39:06 writeAnnotations.py:l71 DEBUG	Writing 8 genomes
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 160547.52genome/s]
2024-05-30 12:39:06 writeAnnotations.py:l105 DEBUG	Writing 1600 contigs
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1600/1600 [00:00<00:00, 652365.74contigs/s]
2024-05-30 12:39:06 writeAnnotations.py:l148 DEBUG	Writing 36656 genes
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36656/36656 [00:00<00:00, 169713.94gene/s]
2024-05-30 12:39:06 writeAnnotations.py:l297 DEBUG	Writing 36509 gene-related data (can be lower than the number of genes)
 93%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌           | 33897/36509 [00:00<00:00, 687257.45genedata/s]

I can supply a few of the annotation files that I'm using as a test if necessary.

Thanks a lot.

Hi @dmartimarti,
A few of the annotation files would be very helpful in deed to check what's going on here.
Thanks !

The error seems a bit similar to the one encountered in these issues: #95, #175 and #222. However, here the problem seems to be with the gene name and not the product.

You might try to catch any problematic characters with this grep command on your gbff files:
LC_ALL=C grep -n -P [$'\x80'-$'\xFF'] *.g*ff

Hi @JeanMainguy

That was it! For the record, it was again one of these double-wing motiff proteins (gene mmcQ) the responsible for the error.
I tried removing the non-ASCII characters from the gff3 files and this time it worked with a test subset I was playing with.

Thanks a lot for your prompt help!