PHAROKKA introducing dataframe and its dtype into NCBI Feature Table (.tbl) in each Loop.
deinotoxazumab opened this issue · 5 comments
- pharokka version: 1.5.1
- Python version: 3.10.8 (pharokkaENV) / 3.10.13 (base)
- Operating System: Pop!_OS 22.04 LTS
Description
We performed a by-the-book Phage WGS annotation using Pharokka as per instructions on its -h page and Readme.md page. The annotation process ran with no issues with a full set of useful outputs especially the NCBI-compliant Feature Table (.tbl) prior to submission to NCBI GenBank via BankIt
The Feature Table (.tbl) output by Pharokka contained additional data which caused an error on the BankIt submission process. The additional data introduced were a dataframe of Feature Number and its corresponding Translation Number, as well as the dataframe's dtype description. Both of these data were inserted in a loop for each Feature entry on the Table. The .tbl output snippet in question as follows:
>Feature contig00001
405 1 CDS
product hypothetical protein
locus_tag PHGE_CDS_0001
transl_table 0 11
1 11
2 11
3 11
4 11
..
260 11
261 11
262 11
263 11
264 11
Name: transl_table, Length: 265, dtype: object
1283 465 CDS
product endolysin
locus_tag PHGE_CDS_0002
transl_table 0 11
1 11
2 11
3 11
4 11
..
260 11
261 11
262 11
263 11
264 11
Name: transl_table, Length: 265, dtype: object
Thankfully these additional data could be easily deleted by a simple Find and Replace method on Notepad++, and the BankIt submission process proceeded without any issues with the cleaned Feature Table file.
What I Did
We first launched a Mamba environment of pharokkaENV, installed all of Pharokka's necessary databases, and ran the annotation process suited to our needs. The command lines used as follows:
mamba activate pharokkaENV
install_databases.py --default
pharokka.py --infile /home/klaswell/Documents/phage/Shovill_PHGE.fasta --outdir /home/warzone2/Documents/phage/Shovill_PHGE_pharokka --threads 14 --prefix PHGE --locustag PHGE --force --database /home/klaswell/anaconda3/envs/pharokkaENV/databases/
Update: In order to protect privacy, the names were vetted, but did not affect the workflow.
I also see this issue when generating genbank submission files with the output from pharokka using table2asn
table2asn -M n -i TIVP-H6_dnaapler_reoriented.fasta -f TIVP-H6.gff -o TIVP-H6.sqn -Z -locus-tag-prefix SFTJVHGW
Not sure if it's an issue with pharokka annotation formatting or table2asn itself.
Hi @acvill & @deinotoxazumab ,
The first is definitely a pharokka bug. I'm aiming to sort out the piled-up issues this week sometime :)
The second I'm not sure but I will do some investigation, I do know table2asn can be painful (from Torsten's experience with it in Prokka).
George
I've fixed your error (hopefully!) - it will be available in v1.6. Very simple fix.
@acvill ,
I tried running the latest table2asn (specifically mac.table2asn from the NCBI ftp site 26 September 2023 version) with a similar command to yours.
For me it ran fine and seemed to produce a correct .sqn format file.
So maybe it's a table2asn issue on your end.
George
With version 1.6.1, I submitted the .tbl
files generated by pharokka to BankIt without issue. Many thanks!