enormandeau/gawn

Producing proteome and transcriptome fasta files

Closed this issue · 4 comments

Hello,
First of all, I'd like to say thanks for developing GAWN - I find it very useful and easy to use.
I wanted to ask for advice and/or suggest a new feature.
It would be nice if, apart from the gff output, GAWN will also produce fasta files containing transcript and protein sequences of the predicted genes (similar to MAKER's output).
It seems like gffread can produce transcript sequences, which can then be used to produce protein sequences (maybe with TransDecoder). So maybe you can incorporate this logic into the pipeline or perhaps do something smarter.
Would you say that the strategy I suggested is valid for obtaining transcripts and proteins?
Thanks!

It would be easy to get transcripts from the genome that correspond to the information found in the gff file, but there would be no guaranty that you would have unique representants of each gene or complete genes or genes that are correct. For example, you could have 6 different copies of the same gene, possibly with different isoforms, because you had 6 copies of that gene in the transcript set used for the annotation. Then, you would have some genes missing one or more exons or having bad exons. Lastly, you could have genes in the GFF that have a few exons in one region of a chromosome and then one or more exons really far away on the same scaffold. Theses could be caused by misassembly or spurious alignments form gmap.

Knowing all these potential pitfalls, and knowing that I am not going to implement something fancy, do you think it would be useful to have transcript sequences extracted from the genome based on the information found in the GFF file?

If yes, I could add a Python script to GAWN to produce these sequences.

Thanks for the reply. I ended up writing my own scripts to "clean up" GAWN results in order to avoid issues such as the ones you mentioned. I also wrote a simple script that extracts transcript ("cDNA" - based on exon features) and protein (based on CDS features) sequences. As you said - nothing fancy. I can share them if you think this is useful.

I would love to include potentially modified versions of your scripts in GAWN, at least in 01_scripts/util. You can fork GAWN, add the scripts, commit and push your changes and then make a merge request to have them added. Another option is to send me the files directly.

I'm marking this as closed but please do not hesitate to share your solution so it can be added as a utility script to GAWN.