error reading bakta gff3 files, conversion to prokka gff to make it working with panaroo 1.3.0
FTouzain opened this issue · 3 comments
When running panaroo 1.3.0 with the command line (using bakta gff3):
I obtain the following error:
panaroo -t 6 --codon-table 11 -o panaroo_error --clean-mode moderate --remove-invalid-genes -c 0.95 -f 0.7 -c 0.95 -f 0.7 --len_dif_percent 0.98 --aligner mafft --alignment core -i bakta/*/*.gff3
pre-processing gff3 files...
0%| | 0/7 [00:00<?, ?it/s]Problem reading GFF3 file: bakta/ASM1293158/ASM1293158.gff3
Problem reading GFF3 file: bakta/ASM1487273/ASM1487273.gff3
Problem reading GFF3 file: bakta/ASM164327/ASM164327.gff3
Error reading prokka input!
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 428, in _process_worker
r = call_item()
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 275, in __call__
return self.fn(*self.args, **self.kwargs)
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 620, in __call__
return self.func(*args, **kwargs)
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/joblib/parallel.py", line 289, in __call__
for func, args, kwargs in self.items]
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/joblib/parallel.py", line 289, in <listcomp>
for func, args, kwargs in self.items]
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/panaroo/prokka.py", line 127, in get_gene_sequences
raise RuntimeError("Error reading prokka input!")
RuntimeError: Error reading prokka input!
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/panaroo/prokka.py", line 272, in process_prokka_input
for gff_no, gff in job)
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/joblib/parallel.py", line 1098, in __call__
self.retrieve()
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/joblib/parallel.py", line 975, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
return future.result(timeout=timeout)
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
RuntimeError: Error reading prokka input!
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/bin/panaroo", line 11, in <module>
sys.exit(main())
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/panaroo/__main__.py", line 300, in main
args.n_cpu, args.table)
File "PATH/.snakemake/conda/ece69840e0a63e0086cb1e591e3bb91a_/lib/python3.6/site-packages/panaroo/prokka.py", line 282, in process_prokka_input
raise RuntimeError("Error reading prokka input!")
RuntimeError: Error reading prokka input!
Note: I use panaroo 1.3.0, because I encountered some errors after mamba installing 1.3.4 version when running panaroo and I know 1.3.0 is working (previous tests somes monthes ago).
I found how to work with the bakta gff3 files.
- foreach gff3 file, do:
In first line:
- first grep remove lines starting with '# ' (not found in prokka gff)
- second grep remove unseless bakta comments (not found in prokka gff)'
we write sequence 'header' at the beginning of the new file (bakta write them along the gff file, prokka at the beginning)
grep -Ev '^\# ' {input.gff} | grep -Ev '^\#\#feature\-ontology ' | grep '^\#' > {output}
we write all other lines at the end of the new gff file
grep -Ev '^\# ' {input.gff} | grep -Ev '^\#\#feature\-ontology ' | grep -v '^\#' >> {output}
add 'fasta tag' missing in bakta gff at the end of the new gff file
add fasta sequence found in .fna file of bakta output and missing in bakta gff (prokka gff have this fasta sequence)
echo '##FASTA' >> {output}
cut -d ' ' -f 1 {input.fna} >> {output}
where:
- {input.gff} is the original bakta gff3 file
- {input.fna} is the original bakta fna file
- {output} is the new gff file (converted to be 'prokka-gff-like')
- BAKTA_DIR is the original directory of bakta output
- BAKTA_CONVERTED_DIR is the directory for new bakta gff files
code extracted from a snakemake rule:
rule convert_baktagff_to_prokkagff:
input:
gff = BAKTA_DIR + "/{genome}/{genome}.gff3",
fna = BAKTA_DIR + "/{genome}/{genome}.fna"
output: BAKTA_CONVERTED_DIR + "/{genome}/{genome}.gff"
wildcard_constraints: genome=r'[A-Za-z0-9]+'
# prokka set header names at the beginning, not bakta that set it in the file
# 1: we remove useless comment lines and retain sequence headers to put at the beginning of the gff file
# 2: we remove useless comment lines and retain other lines of the gff file
# 3: we add fasta header ##FASTA
# 4: we add fasta sequences by putting fna files at the end of the gff file (prokka gff includes fasta, bakta no)
shell: '''
grep -Ev '^\# ' {input.gff} | grep -Ev '^\#\#feature\-ontology ' | grep '^\#' > {output}
grep -Ev '^\# ' {input.gff} | grep -Ev '^\#\#feature\-ontology ' | grep -v '^\#' >> {output}
echo '##FASTA' >> {output}
cut -d ' ' -f 1 {input.fna} >> {output}
'''
I hope it can help.
Best regards
Hi,
Thanks very much for this. As an alternative, you could also provide Panaroo with a text file listing the locations of each Bakta GFF and fasta file as outlined here.
It will then use a different approach when loading the Bakta GFFs.
I have a text file with the bakta output gbff file, followed by the genome in fasta format.
ED002_chromosome.fasta.gbff
ED002_chromosome.fasta
ED003_final_scaffolds.fasta.gbff
ED003_final_scaffolds.fasta
When I run the command,
panaroo -i panaroo_input.tsv -o panaroo_output --clean-mode strict -a core --aligner mafft -t 64
I am running into an error,
WARNING: The following feature was skipped:
type: ncRNA
location: 113111:113177
qualifiers:
Key: db_xref, Value: ['RFAM:RF00057', 'SO:0000655']
Key: gene, Value: ['ryhB']
Key: inference, Value: ['profile:Rfam:RF00057']
Key: locus_tag, Value: ['GCFIFN_00535']
Key: ncRNA_class, Value: ['other']
Key: product, Value: ['RyhB RNA']Traceback (most recent call last):
File "/home/nala0006/miniconda3/envs/panaroo/bin/panaroo", line 10, in
sys.exit(main())
File "/home/nala0006/miniconda3/envs/panaroo/lib/python3.10/site-packages/panaroo/main.py", line 302, in main
raise RuntimeError(f"Invalid file extension! ({ext})")
RuntimeError: Invalid file extension! (.fasta)
I have also tried to update the input text file to include the gbff and the fasta file in one line spearated by space.
ED002_chromosome.fasta.gbff ED002_chromosome.fasta
ED003_final_scaffolds.fasta.gbff ED003_final_scaffolds.fasta
With the input file formatted this way, running the same command - I am running into another invalid file extension error
Traceback (most recent call last):
File "/home/nala0006/miniconda3/envs/panaroo/bin/panaroo", line 10, in
sys.exit(main())
File "/home/nala0006/miniconda3/envs/panaroo/lib/python3.10/site-packages/panaroo/main.py", line 304, in main
files.append(create_temp_gff3(line[0], line[1], temp_dir))
File "/home/nala0006/miniconda3/envs/panaroo/lib/python3.10/site-packages/panaroo/prokka.py", line 97, in create_temp_gff3
raise RuntimeError(f"Invalid file extension! ({ext})")
RuntimeError: Invalid file extension! (.gbff)
Figured this out, I used the gff3 output from bakta instead of gbff and ran the command,
panaroo -i *.gff3 -o panaroo -a core --clean-mode strict --core_threshold 0.98 -t 64