gbouras13/pharokka

PHAROKKA introducing dataframe and its dtype into NCBI Feature Table (.tbl) in each Loop.

deinotoxazumab opened this issue · 5 comments

  • pharokka version: 1.5.1
  • Python version: 3.10.8 (pharokkaENV) / 3.10.13 (base)
  • Operating System: Pop!_OS 22.04 LTS

Description

We performed a by-the-book Phage WGS annotation using Pharokka as per instructions on its -h page and Readme.md page. The annotation process ran with no issues with a full set of useful outputs especially the NCBI-compliant Feature Table (.tbl) prior to submission to NCBI GenBank via BankIt

The Feature Table (.tbl) output by Pharokka contained additional data which caused an error on the BankIt submission process. The additional data introduced were a dataframe of Feature Number and its corresponding Translation Number, as well as the dataframe's dtype description. Both of these data were inserted in a loop for each Feature entry on the Table. The .tbl output snippet in question as follows:

>Feature contig00001
405 1 CDS
product hypothetical protein
locus_tag PHGE_CDS_0001
transl_table 0      11
1      11
2      11
3      11
4      11
       ..
260    11
261    11
262    11
263    11
264    11
Name: transl_table, Length: 265, dtype: object
1283 465 CDS
product endolysin
locus_tag PHGE_CDS_0002
transl_table 0      11
1      11
2      11
3      11
4      11
       ..
260    11
261    11
262    11
263    11
264    11
Name: transl_table, Length: 265, dtype: object

Thankfully these additional data could be easily deleted by a simple Find and Replace method on Notepad++, and the BankIt submission process proceeded without any issues with the cleaned Feature Table file.

What I Did

We first launched a Mamba environment of pharokkaENV, installed all of Pharokka's necessary databases, and ran the annotation process suited to our needs. The command lines used as follows:

mamba activate pharokkaENV
install_databases.py --default
pharokka.py --infile /home/klaswell/Documents/phage/Shovill_PHGE.fasta --outdir /home/warzone2/Documents/phage/Shovill_PHGE_pharokka --threads 14 --prefix PHGE --locustag PHGE --force --database /home/klaswell/anaconda3/envs/pharokkaENV/databases/

Update: In order to protect privacy, the names were vetted, but did not affect the workflow.

acvill commented

I also see this issue when generating genbank submission files with the output from pharokka using table2asn

table2asn -M n -i TIVP-H6_dnaapler_reoriented.fasta -f TIVP-H6.gff -o TIVP-H6.sqn -Z -locus-tag-prefix SFTJVHGW

Not sure if it's an issue with pharokka annotation formatting or table2asn itself.

Hi @acvill & @deinotoxazumab ,

The first is definitely a pharokka bug. I'm aiming to sort out the piled-up issues this week sometime :)

The second I'm not sure but I will do some investigation, I do know table2asn can be painful (from Torsten's experience with it in Prokka).

George

@deinotoxazumab ,

I've fixed your error (hopefully!) - it will be available in v1.6. Very simple fix.

@acvill ,

I tried running the latest table2asn (specifically mac.table2asn from the NCBI ftp site 26 September 2023 version) with a similar command to yours.

For me it ran fine and seemed to produce a correct .sqn format file.

So maybe it's a table2asn issue on your end.

George

acvill commented

With version 1.6.1, I submitted the .tbl files generated by pharokka to BankIt without issue. Many thanks!

Great to hear @acvill , I have a bunch to do myself soon :)