Writing gene-related data failed
Closed this issue · 3 comments
Hi, first of all, thanks for creating and supporting this amazing software, it's been very helpful so far.
I am doing a pangenome from several E. coli strains we have sequenced in our lab. I got their annotation using bakta with the latest complete db (5.1), and then fed these annotations to the complete workflow:
ppanggolin all --anno genomes.gbff.txt --output ppanggolin_results -c 2 --verbose 2 -f
However, when it comes to writing all gene-data in the h5f file I'm getting an error related to the object class:
Traceback (most recent call last):
File "tables/tableextension.pyx", line 1676, in tables.tableextension.Row.__setitem__
TypeError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/bin/ppanggolin", line 10, in <module>
sys.exit(main())
File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/main.py", line 219, in main
ppanggolin.workflow.all.launch(args)
File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/workflow/all.py", line 288, in launch
launch_workflow(args, panrgp=True, panmodule=True)
File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/workflow/all.py", line 61, in launch_workflow
write_pangenome(pangenome, filename, args.force, disable_bar=args.disable_prog_bar)
File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/formats/writeBinaries.py", line 711, in write_pangenome
write_annotations(pangenome, h5f, disable_bar=disable_bar)
File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/formats/writeAnnotations.py", line 342, in write_annotations
write_genedata(pangenome, h5f, annotation, genedata2gene, disable_bar)
File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/formats/writeAnnotations.py", line 309, in write_genedata
genedata_row["name"] = genedata.name
File "tables/tableextension.pyx", line 1681, in tables.tableextension.Row.__setitem__
TypeError: invalid type (<class 'str'>) for column ``name``
/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/tables/file.py:113: UnclosedFileWarning:
Closing remaining open file: ppanggolin_results/pangenome.h5
Here is the complete output from the run.
2024-05-30 12:39:04 utils.py:l168 INFO Command: /rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/bin/ppanggolin all --anno genomes.gbff.txt --output ppanggolin_results -c 2 --verbose 2 -f
2024-05-30 12:39:04 utils.py:l169 INFO PPanGGOLiN version: 2.0.5
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--anno: genomes.gbff.txt" has been specified in the command line with a non-default value. Its value overwrites the default value (None).
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--force: True" has been specified in the command line with a non-default value. Its value overwrites the default value (False).
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--output: ppanggolin_results" has been specified in the command line with a non-default value. Its value overwrites the default value (ppanggolin_output_DATE2024-05-30_HOUR12.39.04_PID2061566).
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--verbose: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l668 DEBUG 4 all parameters have non-default value: cpu=2, force=True, output=ppanggolin_results, verbose=2
2024-05-30 12:39:04 utils.py:l679 DEBUG Parsing annotate arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG 1 annotate parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG Parsing cluster arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG 1 cluster parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG Parsing graph arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG Parsing partition arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG 1 partition parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG Parsing rarefaction arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG 1 rarefaction parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG Parsing rgp arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG Parsing spot arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG Parsing module arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG 1 module parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG Parsing draw arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG Parsing write_pangenome arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG 1 write_pangenome parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG Parsing write_genomes arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG 1 write_genomes parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l722 INFO 11 parameters have a non-default value.
2024-05-30 12:39:04 annotate.py:l503 INFO Reading genomes.gbff.txt the list of genome files ...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 5.76file/s]
2024-05-30 12:39:06 annotate.py:l535 INFO gene identifiers used in the provided annotation files were unique, PPanGGOLiN will use them.
2024-05-30 12:39:06 writeBinaries.py:l709 INFO Writing genome annotations...
2024-05-30 12:39:06 writeAnnotations.py:l71 DEBUG Writing 8 genomes
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 160547.52genome/s]
2024-05-30 12:39:06 writeAnnotations.py:l105 DEBUG Writing 1600 contigs
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1600/1600 [00:00<00:00, 652365.74contigs/s]
2024-05-30 12:39:06 writeAnnotations.py:l148 DEBUG Writing 36656 genes
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36656/36656 [00:00<00:00, 169713.94gene/s]
2024-05-30 12:39:06 writeAnnotations.py:l297 DEBUG Writing 36509 gene-related data (can be lower than the number of genes)
93%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 33897/36509 [00:00<00:00, 687257.45genedata/s]
I can supply a few of the annotation files that I'm using as a test if necessary.
Thanks a lot.
Hi @dmartimarti,
A few of the annotation files would be very helpful in deed to check what's going on here.
Thanks !
The error seems a bit similar to the one encountered in these issues: #95, #175 and #222. However, here the problem seems to be with the gene name and not the product.
You might try to catch any problematic characters with this grep command on your gbff files:
LC_ALL=C grep -n -P [$'\x80'-$'\xFF'] *.g*ff
Hi @JeanMainguy
That was it! For the record, it was again one of these double-wing motiff proteins (gene mmcQ) the responsible for the error.
I tried removing the non-ASCII characters from the gff3 files and this time it worked with a test subset I was playing with.
Thanks a lot for your prompt help!