gbouras13/pharokka

Error calling tRNAscan-SE

asierFernandezP opened this issue ยท 19 comments

phrokka version: v1.4.1
Python version: Python 3.10.12
Operating System: Computing cluster

Description

I have run prodigal-gv on my viral genomes to get the ORFs (it also outputs a genbank formatted CSV file). Using the predicted ORFs as input I want to run Pharroka to annotate them (without running prodigal or Phannotate again).

What I Did

I run the following command:

/home2/p304845/Conda_envs/Pharokka_conda/bin/pharokka.py \
	-i Genbank_formatted_CSV \
	 --genbank \
	-o $output_dir \
	-d /home2/p304845/Conda_envs/Pharokka_conda/databases \
	-t ${SLURM_CPUS_PER_TASK}

And I got the following error:

2023-09-04 17:06:08.757 | INFO     | __main__:main:84 - Starting Pharokka v1.4.1
2023-09-04 17:06:08.759 | INFO     | __main__:main:85 - Command executed: Namespace(infile='/scratch/p304845/Annotation/Pharokka_annotation/viruses_present_output.csv', outdir='/scratch/p304845/Annotation/Pharokka_annotation/Results', database='/home2/p304845/Conda_envs/Pharokka_conda/databases', threads='24', force=False, prefix='Default', locustag='Default', gene_predictor='phanotate', meta=False, split=False, coding_table='11', evalue='1E-05', fast=False, mmseqs2_only=False, meta_hmm=False, dnaapler=False, custom_hmm='', genbank=True, terminase=False, terminase_strand='nothing', terminase_start='nothing', citation=False)
2023-09-04 17:06:08.759 | INFO     | __main__:main:86 - Repository homepage is https://github.com/gbouras13/pharokka
2023-09-04 17:06:08.759 | INFO     | __main__:main:87 - Written by George Bouras: george.bouras@adelaide.edu.au
2023-09-04 17:06:08.759 | INFO     | __main__:main:89 - Checking database installation in /home2/p304845/Conda_envs/Pharokka_conda/databases.
2023-09-04 17:06:08.767 | INFO     | __main__:main:92 - All databases have been successfully checked.
2023-09-04 17:06:08.767 | INFO     | __main__:main:108 - Checking dependencies.
2023-09-04 17:06:09.540 | INFO     | input_commands:check_dependencies:322 - Phanotate version found is v1.5.1
2023-09-04 17:06:09.541 | INFO     | input_commands:check_dependencies:331 - Phanotate version is ok.
2023-09-04 17:06:09.695 | INFO     | input_commands:check_dependencies:353 - MMseqs2 version found is v13.45111
2023-09-04 17:06:09.695 | INFO     | input_commands:check_dependencies:362 - MMseqs2 version is ok.
2023-09-04 17:06:10.410 | INFO     | input_commands:check_dependencies:386 - tRNAscan-SE version found is v2.0.12
2023-09-04 17:06:10.411 | INFO     | input_commands:check_dependencies:397 - tRNAscan-SE version is ok.
2023-09-04 17:06:11.627 | INFO     | input_commands:check_dependencies:421 - MinCED version found is v0.4.2
2023-09-04 17:06:11.628 | INFO     | input_commands:check_dependencies:432 - MinCED version is ok.
2023-09-04 17:06:11.675 | INFO     | input_commands:check_dependencies:458 - ARAGORN version found is v1.2.41
2023-09-04 17:06:11.676 | INFO     | input_commands:check_dependencies:469 - ARAGORN version is ok.
2023-09-04 17:06:12.063 | INFO     | input_commands:check_dependencies:485 - mash version found is v2.3
2023-09-04 17:06:12.064 | INFO     | input_commands:check_dependencies:492 - mash version is ok.
2023-09-04 17:06:12.843 | INFO     | input_commands:check_dependencies:511 - Dnaapler version found is v0.3.1
2023-09-04 17:06:12.843 | INFO     | input_commands:check_dependencies:518 - Dnaapler version is ok.
2023-09-04 17:06:12.843 | INFO     | input_commands:check_dependencies:529 - Pyrodigal version is v2.3.0
2023-09-04 17:06:12.843 | INFO     | input_commands:check_dependencies:530 - Pyrodigal version is ok.
2023-09-04 17:06:12.844 | INFO     | __main__:main:113 - You have specified --genbank.
2023-09-04 17:06:12.844 | INFO     | __main__:main:114 - Therefore, /scratch/p304845/Annotation/Pharokka_annotation/viruses_present_output.csv is a genbank file instead of a FASTA file.
2023-09-04 17:06:12.844 | INFO     | __main__:main:117 - Your custom CDS calls in this genbank file will be preserved.
2023-09-04 17:06:14.652 | INFO     | __main__:main:280 - Extracting CDS information from your genbank file.
2023-09-04 17:06:17.836 | INFO     | __main__:main:291 - Starting tRNA-scanSE.
2023-09-04 17:06:17.840 | INFO     | external_tools:run:50 - Started running tRNAscan-SE --thread 24 -G -Q -j /scratch/p304845/Annotation/Pharokka_annotation/Results/trnascan_out.gff /scratch/p304845/Annotation/Pharokka_annotation/Results/genbank.fasta ...
2023-09-04 17:06:18.037 | ERROR    | external_tools:run_tool:94 - Error calling tRNAscan-SE --thread 24 -G -Q -j /scratch/p304845/Annotation/Pharokka_annotation/Results/trnascan_out.gff /scratch/p304845/Annotation/Pharokka_annotation/Results/genbank.fasta (return code 2)
2023-09-04 17:06:18.037 | ERROR    | processes:run_trna_scan:532 - Error: tRNAscan-SE not found

Thanks in advance!

Hi @asierFernandezP,

Thanks for trying out --genbank mate, glad someone found it useful. The intended function was for manually curated CDS calls, not prodigal-gv, but I should have guessed this would come up too.

First of all I clearly need to change the error messages from '__ not found' to '__ failed' I think so thanks for that.

I have had a crack at replicating this and I think the issue is that the GenBank output format of prodigal-gv isn't fully specified. In other words, it does not include the CDS translations or the FASTA file, so I can't actually get any sequence information from it (first step pharokka does with --genbank is to save all amino acid CDS under translation and to output the input nucleotide contigs reconstructed from the GenBank as a fasta).

Therefore I have 2 options to fix this:

  1. Work with @althonos to get prodigal-gv's models added to pyrodigal althonos/pyrodigal#24. But this seems like a lot of work to do when prodigal-gv already exists, plus it feels like more of a fork-related feature I think than a core pyrodigal feature (it's up to Martin anyway after all).
  2. Implement prodigal-gv support into pharokka as a gene prediction option (-g), with some possible help from @apcamargo.

I am aware that the Genbank format is a nightmare, so I don't wish to spend a lot of time down the GenBank rabbit hole, but I do think prodigal-gv would be a pretty common option for users so I am happy to spend some time to include it (indeed I have some genomad output I would like pharokka to smash through and annotate myself soon).

I will need to change the gff to gbk conversion function of pharokka regardless of what I do to deal with cases where different contigs have different translation codes, so I think 2. would be the most preferable option.

George

Hi George,

Thank you so much for your quick answer!! Indeed, I think the 2nd option would be ideal and more convenient. As you mentioned, the problem is that prodigal-gv outputs a Genbank formatted CSV file without the CDS sequences.

Let me know if it would be possible for you to implement it or if any further information/help from my side is needed! And thanks again for the great tool :))

No problems @asierFernandezP!

I will aim to implement this when I get some time hopefully in the next week or 2 - and this will make pharokka v1.5.0.

I'll let you know regarding help - I might get you to do some testing before I release it. But otherwise I think I know what I need to implement to make this work.

George

Work with @althonos to get prodigal-gv's models added to pyrodigal althonos/pyrodigal#24. But this seems like a lot of work to do when prodigal-gv already exists, plus it feels like more of a fork-related feature I think than a core pyrodigal feature (it's up to Martin anyway after all).

I don't think I will integrate it to "stock" Pyrodigal, but i can probably figure out a way to get that into a different dedicated package that has the prodigal-gv models, especially if you'd find it useful right away ๐Ÿ‘

I think it would be very useful quite soon for lots of users in the gut metagenome/giant virus etc space.

Also more generally speaking as sort of mentioned in althonos/pyrodigal#24, not sure what you have in mind but there's some scope to build upon your methods for alternative coding prediction and within-genome stop codon reassignment (see e.g. https://github.com/gatech-genemark/Mgcod ).

Happy to help out and contribute whatever I can to help make this happen (and I'm sure @asierFernandezP and @apcamargo would be keen too by the looks).

George

@gbouras13 : pip install -U --pre pyrodigal-gv if you wanna test ๐Ÿ˜‰

Also more generally speaking as sort of mentioned in althonos/pyrodigal#24, not sure what you have in mind but there's some scope to build upon your methods for alternative coding prediction and within-genome stop codon reassignment (see e.g. https://github.com/gatech-genemark/Mgcod ).

On that note, I think within-genome stop codon reassignment is probably out of scope for Prodigal, because the dynamic programming needs to have the codons from the entire genome to start finding the highest scoring gene path. However, alternative coding prediction could be feasible, similarly to the meta mode, trying to nodes with several different genetic codes before extracting genes for the highest scoring one.

@althonos I've given it a test run (with the pyrodigal v3 alpha installed from source), it looks phenomenal mate.

The only thing lacking I can think that would be useful (in terms of recapitulating the pyrodigal-gv output and generally) is 'genetic_code' in the description (it's between rbs_spacer and gc_cont).

prodigal-gv v2.11 output


MZ130495.1      Prodigal_v2.11.0-gv     CDS     98636   101128  474.7   +       0       ID=1_100;partial=01;start_type=ATG;rbs_motif=TAA;rbs_spacer=12bp;genetic_code=15;gc_cont=0.361;conf=99.99;score=474.69;cscore=467.02;sscore=7.67;rscore=5.99;uscore=1.68;tscore=0.00;

pyrodigal-gv output


MZ130495.1	pyrodigal_v3.0.0-alpha1	CDS	98636	101128	474.7	+	0	ID=MZ130495.1_100;partial=01;start_type=ATG;rbs_motif=TAA;rbs_spacer=12bp;gc_cont=0.361;conf=99.99;score=474.69;cscore=467.02;sscore=7.67;rscore=5.99;uscore=1.68;tscore=0.00;

the code:

def run_pyrodigal_gv(filepath_in, out_dir, coding_table):

    orf_finder = pyrodigal_gv.ViralGeneFinder(meta=True)

    with open(os.path.join(out_dir, "prodigal_out.gff"), "w") as dst:
        for i, record in enumerate(SeqIO.parse(filepath_in, "fasta")):
            genes = orf_finder.find_genes(str(record.seq))
            genes.write_gff(dst, sequence_id=record.id)

George

@gbouras13 : I have made a new Pyrodigal pre-release that supports that (pip install -U --pre pyrodigal to get 3.0.0-alpha2), this way you get a new include_translation_table keyword argument among other things that outputs the translation table. (I'm using translation_table to remain consistent with Prodigal).

MZ130495.1	pyrodigal_v3.0.0-alpha2	CDS	98636	101128	474.7	+	0	ID=MZ130495.1_100;partial=01;start_type=ATG;rbs_motif=TAA;rbs_spacer=12bp;gc_cont=0.361;transl_table=15;conf=99.99;score=474.69;cscore=467.02;sscore=7.67;rscore=5.99;uscore=1.68;tscore=0.00;

Beautiful, outputs look great to me.

More than happy to do any testing that you would find useful Martin.

George

You're most likely the main user of this with @apcamargo at the moment so I'll keep on alpha for a bit while you try it around ๐Ÿ‘

I'm still testing it out. But no problems so far!

I've tested it out today on a number of single phages + a 673 contig gut phage test set (from Yutin et al), around 50MB fasta file.

The results look great, I get exactly what is expected and the output is easily parsable for me.

With pyrodigal v3.0.0-alpha2 & pyrodigal v0.1.0a1 implemented in Pharokka v1.5.0 (dev branch):

Running on Intel Core i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS, it took 305 seconds to run the pyrodigal-gv step (I didn't do proper measurements with time, this is just coming from the Pharokka output log).

This compares to 184 seconds for regular pyrodigal on the same dataset.

For single phage contigs runtime is negligible for both (sub 1 second).

Overall, I am happy with the output for my purposes, when you release pyrodigal v3 and pyrodigal_gv as a stable release, I will soon after release Pharokka v1.5.0 with support for pyrodigal-gv.

Thanks @althonos and @apcamargo amazing stuff!

George

Nice to hear! ๐Ÿ™ I also don't know if you've seen but there's an option to run meta mode only using viral models, in case you're looking for speedup ๐Ÿ‘ Maybe something to expose as a flag?

Thanks @gbouras13! I don't remember having a slowdown this dramatic when comparing Prodigal to prodigal-gv. Can you check if execution time decreases when you use the viral_only argument? I wouldn't recommend it for production because it won't cover all phages.

Hey @apcamargo and @althonos ,

This indeed seems to explain the difference - it took 178 seconds with viral_only=True.

For production I will follow your advice so I will leave it as was.

Either way, it is more than quick enough for Pharokka's purposes, a few minutes to do CDS prediction on nearly a thousand contigs is pretty efficient.

George

Thank you guys @althonos @gbouras13 @apcamargo! That's called efficiency! Just tested it on a set of ~2000 metagenomic viral contigs and no problems observed!

@althonos @apcamargo I'm releasing v1.5.0 today. I've added a citation to the citations section for pyrodigal-gv as this:

Larradle M. and Camargo A., (2023) Pyrodigal-gv: A Pyrodigal extension to predict genes in giant viruses and viruses with alternative genetic code. https://github.com/althonos/pyrodigal-gv.

If you want to change it just let me know.

George

Just spell my name right and we're good ๐Ÿ˜„

Easiest commit I have made, sorry about that @althonos but fixed now :)