how should we generate the gene to transcript map?
ctb opened this issue · 6 comments
@ljcohen has some code, maybe... :)
import pandas as pd
from dammit.fileio.gff3 import GFF3Parser
gff_file = "trinity.nema.fasta.dammit.gff3"
annotations = GFF3Parser(filename=gff_file).read()
names = annotations.sort_values(by=['seqid', 'score'], ascending=True).query('score < 1e-05').drop_duplicates(subset='seqid')[['seqid', 'Name']]
new_file = names.dropna(axis=0,how='all')
new_file.head()
new_file.to_csv("nema_gene_name_id.csv")
Now that dammit has our --no-rename
, flag implemented, might be worth using that to stop dammit from renaming trinity transcripts, so that we maintain trinity "gene" and "transcript" designations.
Code above still required if we want to instead build gene-transcript relationships based on annotations.
Not sure - I'm actually installing dammit from master
https://github.com/dib-lab/eelpond/blob/master/rules/dammit/dammit-env.yaml
(so the feature is implemented). Camille fixed a downstream issue with my implementation in ~ dec, so I think it should be fully working now.
talking with @mandyfrazier wanting to revive discussion about incorporating a script to make the tx2gene file from dammit gff3 output. This requires making decisions about which annotation to pick per contig, e.g. sorting by e-value, length of annotation, choosing custom database annotations. Word cloud? (as suggested by @ctb)
Example below, one transcript which should be annotated COX5B. How to programmatically choose that?
Transcript_100014 HMMER protein_hmm_match 76 273 3.300000e-17 . . ID=homology:234985;Name=COX5B;Target=COX5B 19 86 +;Note=Cytochrome c oxidase subunit Vb;accuracy=0.86;env_coords=28 285;Dbxref="Pfam:PF01215.15"
Transcript_100014 LAST translated_nucleotide_match 92 367 3.200000e-60 - . ID=homology:390878;Name=ENSXMAP00000000371;Target=ENSXMAP00000000371 0 90 +;database=OrthoDB
Transcript_100014 LAST translated_nucleotide_match 92 367 5.700000e-36 - . ID=homology:511790;Name=sp|P10606|COX5B_HUMAN;Target=sp|P10606|COX5B_HUMAN 0 93 +;database=sprot
Transcript_100014 shmlast.LAST conditional_reciprocal_best_LAST 183 224 2.000000e-19 + . ID=homology:748371;Name=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 86 128 +;database=kfish2rae5g.pub.aa
Transcript_100014 shmlast.LAST conditional_reciprocal_best_LAST 232 297 2.100000e-36 + . ID=homology:748366;Name=Funhe2EKm029180t6 aalen=116,66%,complete; clen=530; offs=138-488; oid=Funhe2Exx11m009875t6,Funhe2E6bm045748t2; organism=Fundulus_heteroclitus; type=protein; isoform=6; Name=Uncharacterized protein;;Target=Funhe2EKm029180t6 aalen=116,66%,complete; clen=530; offs=138-488; oid=Funhe2Exx11m009875t6,Funhe2E6bm045748t2; organism=Fundulus_heteroclitus; type=protein; isoform=6; Name=Uncharacterized protein; 45 111 +;database=kfish2rae5g.pub.aa
Transcript_100014 shmlast.LAST conditional_reciprocal_best_LAST 24 118 7.500000e-56 + . ID=homology:748378;Name=Funhe2EKm029180t4 aalen=102,99%,partial; clen=309; offs=2-307; oid=Funhe2Exx11m009875t4,Fungr1EG3m027840t2; organism=Fundulus_heteroclitus; type=protein; isoform=4; Name=Cytochrome c oxidase subunit 5B, mitochondrial (73%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t4 aalen=102,99%,partial; clen=309; offs=2-307; oid=Funhe2Exx11m009875t4,Fungr1EG3m027840t2; organism=Fundulus_heteroclitus; type=protein; isoform=4; Name=Cytochrome c oxidase subunit 5B, mitochondrial (73%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 3 98 +;database=kfish2rae5g.pub.aa
Transcript_100014 shmlast.LAST conditional_reciprocal_best_LAST 24 122 1.000000e-60 + . ID=homology:748362;Name=Funhe2EKm029180t3 aalen=131,99%,partial; clen=395; offs=3-395; oid=Funhe2Exx11m009875t3,Fungr1EG3m027840t1; organism=Fundulus_heteroclitus; type=protein; isoform=3; Name=Cytochrome c oxidase subunit 5B, mitochondrial (98%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t3 aalen=131,99%,partial; clen=395; offs=3-395; oid=Funhe2Exx11m009875t3,Fungr1EG3m027840t1; organism=Fundulus_heteroclitus; type=protein; isoform=3; Name=Cytochrome c oxidase subunit 5B, mitochondrial (98%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 0 99 +;database=kfish2rae5g.pub.aa
Transcript_100014 shmlast.LAST conditional_reciprocal_best_LAST 31 122 2.300000e-56 + . ID=homology:1228514;Name=ENSFHEP00000007063.1 pep primary_assembly:Fundulus_heteroclitus-3.0.2:KN811366.1:97972:103002:-1 gene:ENSFHEG00000000204.1 transcript:ENSFHET00000004439.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:cytochrome c oxidase subunit 5B, mitochondrial [Source:NCBI gene;Acc:105931138];Target=ENSFHEP00000007063.1 pep primary_assembly:Fundulus_heteroclitus-3.0.2:KN811366.1:97972:103002:-1 gene:ENSFHEG00000000204.1 transcript:ENSFHET00000004439.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:cytochrome c oxidase subunit 5B, mitochondrial [Source:NCBI gene;Acc:105931138] 0 92 +;database=Fundulus_heteroclitus.Fundulus_heteroclitus-3.0.2.pep.all.fa
Transcript_100014 shmlast.LAST conditional_reciprocal_best_LAST 31 122 7.500000e-56 + . ID=homology:748370;Name=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 0 92 +;database=kfish2rae5g.pub.aa
Transcript_100014 transdecoder CDS 632 928 . - 0 ID=cds.Gene.220552::Transcript_100014::g.220552::m.220552;Parent=Gene.220552::Transcript_100014::g.220552::m.220552
Transcript_100014 transdecoder exon 1 1019 . - . ID=Gene.220552::Transcript_100014::g.220552::m.220552.exon1;Parent=Gene.220552::Transcript_100014::g.220552::m.220552
Transcript_100014 transdecoder five_prime_UTR 929 1019 . - . ID=Gene.220552::Transcript_100014::g.220552::m.220552.utr5p1;Parent=Gene.220552::Transcript_100014::g.220552::m.220552
Transcript_100014 transdecoder gene 1 1019 . - . ID=Gene.220552::Transcript_100014::g.220552;Name=ORF%20type%3Acomplete%20len%3A99%20%28-%29
Transcript_100014 transdecoder mRNA 1 1019 . - . ID=Gene.220552::Transcript_100014::g.220552::m.220552;Parent=Gene.220552::Transcript_100014::g.220552;Name=ORF%20type%3Acomplete%20len%3A99%20%28-%29
Transcript_100014 transdecoder three_prime_UTR 1 631 . - . ID=Gene.220552::Transcript_100014::g.220552::m.220552.utr3p1;Parent=Gene.220552::Transcript_100014::g.220552::m.220552
Note that some custom amino acid fasta files might have semicolons. These need to be fixed before using in dammit because of this issue with dammit.