enormandeau/gawn

Update annotation gff3 file with gene names

Silvia-lme opened this issue · 9 comments

Hi Eric,
I wondering if there is some script to update annotated .gff3 file with gene names. In my gff3 file I have all genes named ORF "ID=Contig10.path1;Name=ORF" Is it possible to use transcriptome annotation table or genome annotation table and change ORF to gene name from one of these tables? Thanks. Silvia

Hi Silvia. You would like column 3 of the transcriptome annotation table instead of the transcript IDs?

In ideal case, I would like to keep IDs transcript name, but change the Name from transcript to "Fullname" (yes, the 3rd column of the transcriptome table).

For example, from:
ID=Contig19.path1;Name=Contig19;Dir=indeterminate
to:
ID=Contig19.path1;Name=1-phosphatidylinositol 3-phosphate 5-kinase

Thank you.

GAWN now has a new version, v0.3.6, that features a script to do as you asked.

Use it like this (potentially modifying the names of the files):

./01_scripts/util/rename_genes_in_gff.py 05_results/genome.gff3 05_results/transcriptome_annotation_table.tsv 05_results/genome.gff3.renamed

Thank you so much, Eric. I appreciate your help. I am going to run this and I will let you know. Thank you. Silvia

Hi Eric,
This script doesn't work for my data. Please could you modify it for me. Thanks.

My "transcriptome_annotation_table.tsv" looks like:
Name Accession Fullname Altnames Pfam GO CellularComponent Molecular Function Biological Process
Contig1
Contig2
Contig3
Contig4 Q63406 Guanine nucleotide exchange factor DBS DBL's big sister;MCF2-transforming sequence-like protein;OST oncogene {ECO:0000303|PubMed:7957046}; PF13716;PF00169;PF00621;PF00435; GO:0005737;GO:0019898;GO:0005886;GO:0008289;GO:0005089;GO:0035556;GO:0035025; C:cytoplasm; C:extrinsic component of membrane; C:plasma membrane; F:lipid binding; F:Rho guanyl-nucleotide exchange factor activity; P:intracellular signal transduction; P:positive regulation of Rho protein signal transduction;
Contig5 Q9Z1T6 E9QL40 Q3TNE4 Q3UTT6 Q69ZU1 Q9CU94 1-phosphatidylinositol 3-phosphate 5-kinase FYVE finger-containing phosphoinositide kinase;PIKfyve;Phosphatidylinositol 3-phosphate 5-kinase type III;p235; PF00118;PF00610;PF01363;PF01504; GO:0005911;GO:0031410;GO:0030659;GO:0005829;GO:0031901;GO:0010008;GO:0000139;GO:0031902;GO:0045121;GO:0048471;GO:0012506;GO:0000285;GO:0016308;GO:0052810;GO:0005524;GO:0008270;GO:0035556;GO:0032288;GO:0034504;GO:2000785;GO:0042147; C:cell-cell junction; C:cytoplasmic vesicle; C:cytoplasmic vesicle membrane; C:cytosol; C:early endosome membrane; C:endosome membrane; C:Golgi membrane; C:late endosome membrane; C:membrane raft; C:perinuclear region of cytoplasm; C:vesicle membrane; F:1-phosphatidylinositol-3-phosphate 5-kinase activity; F:1-phosphatidylinositol-4-phosphate 5-kinase activity; F:1-phosphatidylinositol-5-kinase activity; F:ATP binding; F:zinc ion binding; P:intracellular signal transduction; P:myelin assembly; P:protein localization to nucleus; P:regulation of autophagosome assembly; P:retrograde transport, endosome to Golgi;
Contig6
Contig7
Contig8 P97443 P97442 P97444 Q6DFW7 Histone-lysine N-methyltransferase Smyd1 CD8b-opposite;SET and MYND domain-containing protein 1;Zinc finger protein BOP; PF00856;PF01753; GO:0005737;GO:0005634;GO:0003677;GO:0018024;GO:0046872;GO:0003714;GO:0006338;GO:0007507;GO:0045892;GO:0045663;GO:0010831;GO:0035914;GO:0006351; C:cytoplasm; C:nucleus; F:DNA binding; F:histone-lysine N-methyltransferase activity; F:metal ion binding; F:transcription corepressor activity; P:chromatin remodeling; P:heart development; P:negative regulation of transcription, DNA-templated; P:positive regulation of myoblast differentiation; P:positive regulation of myotube differentiation; P:skeletal muscle cell differentiation; P:transcription, DNA-templated;
Contig9 Q8BW74 Q6PF83 Hepatic leukemia factor PF07716; GO:0005634;GO:0000977;GO:0043565;GO:0001077;GO:0001228;GO:0045944;GO:0048511;GO:0035914; C:nucleus; F:RNA polymerase II regulatory region sequence-specific DNA binding; F:sequence-specific DNA binding; F:transcriptional activator activity, RNA polymerase II proximal promoter sequence-specific DNA binding; F:transcriptional activator activity, RNA polymerase II transcription regulatory region sequence-specific DNA binding; P:positive regulation of transcription by RNA polymerase II; P:rhythmic process; P:skeletal muscle cell differentiation;

My genome.gff3 file
##gff-version 3

Generated by GMAP version 2023-02-17 using call: gmap.avx2 -t 19 --dir 03_data -d indexed_genome -f gff3_gene --gff3-add-separators=0

Scaffold_3734;HRSCAF=3925 transdecoder gene 91566599 91592260 . - . ID=Contig10.path1;Name=ORF
Scaffold_4304;HRSCAF=4627 transdecoder gene 30083122 30085636 . - . ID=Contig100.path1;Name=ORF
Scaffold_2408;HRSCAF=2535 transdecoder gene 29787930 29792208 . - . ID=Contig10000.path1;Name=ORF
Scaffold_176;HRSCAF=195 transdecoder gene 46776923 46777652 . + . ID=Contig10001.path1;Name=ORF
Scaffold_4305;HRSCAF=4628 transdecoder gene 2505741 2508969 . + . ID=Contig10004.path1;Name=ORF
Scaffold_3619;HRSCAF=3803 transdecoder gene 5664453 5664928 . + . ID=Contig10005.path1;Name=ORF

If you could share the actual files with me at eric.normandeau.qc@gmail.com I could look at it.

I don't think the gff3 file you gave me was created with GAWN. While I am happy to help support interesting uses of the output files of GAWN, I do not have enough time to support scripts to work on different file formats coming from other programs.

I hope this is understandable.

Take care

Hi Eric, yes it was. Both those files are direct output from your gawn pipeline. Only think that I revised in the last gawn script version 0.3.6 was to activate (removed "#") “Add UTR-3 and UTR-5 regions -> GFF3 result file”, so your script “03_add_utrs.sh” was used.