gtonkinhill/panaroo

List text file input method not working

lillycummins opened this issue · 6 comments

Hi,

I am wanting to use output from Bakta(v1.9.1) as input for panaroo(v1.3.4) so I am using the alternative input format where I provide a list of bakta .gff3 and .fna (tab separated) or .gbff in a text file. Both methods provide the same error messages saying unable to read input format:

pre-processing gff3 files...
0%| | 0/1081 [00:06<?, ?it/s]
Error reading prokka input!
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
r = call_item()
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 291, in call
return self.fn(*self.args, **self.kwargs)
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/joblib/parallel.py", line 589, in call
return [func(*args, **kwargs)
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/joblib/parallel.py", line 589, in
return [func(*args, **kwargs)
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/panaroo/prokka.py", line 198, in get_gene_sequences
else: raise ValueError("Invalid gene sequence!")
ValueError: Invalid gene sequence!
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/panaroo/prokka.py", line 294, in process_prokka_input
gene_sequence_list = Parallel(n_jobs=n_cpu)(
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/joblib/parallel.py", line 1952, in call
return output if self.return_generator else list(output)
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/joblib/parallel.py", line 1595, in _get_outputs
yield from self._retrieve()
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/joblib/parallel.py", line 1699, in _retrieve
self._raise_error_fast()
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/joblib/parallel.py", line 1734, in _raise_error_fast
error_job.get_result(self.timeout)
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/joblib/parallel.py", line 736, in get_result
return self._return_or_raise()
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/joblib/parallel.py", line 754, in _return_or_raise
raise self._result
ValueError: Invalid gene sequence!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/data/biol-micro-genomics/biol0144/envs/pangenome/bin/panaroo", line 10, in
sys.exit(main())
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/panaroo/main.py", line 314, in main
process_prokka_input(args.input_files, args.output_dir,
File "/data/biol-micro-genomics/biol0144/envs/pangenome/lib/python3.9/site-packages/panaroo/prokka.py", line 306, in process_prokka_input
raise RuntimeError("Error reading prokka input!")
RuntimeError: Error reading prokka input!

Sample input data can be found here
Is there an obvious error in the way I am inputting my files/am I using the correct output files from bakta?

Thanks,
Lilly

Hi Lilly,

Bakta often outputs some annotations that Panaroo does not deal with by default. You can filter these out with the --remove-invalid-genes flag.
Let me know if this does not resolve the issue.

Hi Gerry,

Thanks, this flag resolves the issue when I supply a .gbff list but the same error arises when supplying a .gff3 and .fna list with the --remove-invalid-genes flag

Hi Lilly,

Sorry for the slow response. The split file input has not been tested as extensively.
Would it be possible to send a pair of files through that reproduces the issue?

Hi Gerry,

bakta_out.zip

I've also noticed that the panaroo output when using bakta .gbffs has no annotation information (e.g. all clusters are group_XXX with no non-unique gene name)

Hi Lilly,

It looks like this might be caused by strange unicode characters in the GFF3 file. Specifically the following line includes an unusual character after "double-wing"

contig_13	Prodigal	CDS	84751	85107	.	-	0	ID=ONIOAA_14795;Name=putative DNA-binding protein with ‘double-wing’ structural motif%2C MmcQ/YjbR family;locus_tag=ONIOAA_14795;product=putative DNA-binding protein with ‘double-wing’ structural motif%2C MmcQ/YjbR family;Dbxref=COG:COG2315,COG:K,RefSeq:WP_000153726.1,SO:0001217,UniParc:UPI0002185663,UniRef:UniRef100_A0A066SY50,UniRef:UniRef50_P0AF51,UniRef:UniRef90_A0A234IDG1;gene=mmcQ

To address this you could use iconv

iconv -f UTF-8 -t UTF-8 -c 103_ERR1981378.gff3 > converted_103_ERR1981378.gff3

These GFF3 files also include the fasta sequences so you should not need the .fna files.

If possible, it would be great to know how you ran Bakta. I'm not sure how often it outputs characters like these and whether many users are likely to encounter this problem.

Hi Gerry,

Thanks for this, panaroo now works with the converted gff3s! My issue has been solved.

This was my bakta command:

bakta --db path/to/db/ -t 4 --skip-crispr --output path/to/output --prefix ${GENOME} --force ${GENOME}