dib-lab/elvers

how should we generate the gene to transcript map?

ctb opened this issue · 6 comments

ctb commented

@ljcohen has some code, maybe... :)

import pandas as pd
from dammit.fileio.gff3 import GFF3Parser
gff_file = "trinity.nema.fasta.dammit.gff3"
annotations = GFF3Parser(filename=gff_file).read()
names = annotations.sort_values(by=['seqid', 'score'], ascending=True).query('score < 1e-05').drop_duplicates(subset='seqid')[['seqid', 'Name']]
new_file = names.dropna(axis=0,how='all')
new_file.head()
new_file.to_csv("nema_gene_name_id.csv")

dib-lab/dammit#131

Now that dammit has our --no-rename, flag implemented, might be worth using that to stop dammit from renaming trinity transcripts, so that we maintain trinity "gene" and "transcript" designations.

Code above still required if we want to instead build gene-transcript relationships based on annotations.

ctb commented

Not sure - I'm actually installing dammit from master
https://github.com/dib-lab/eelpond/blob/master/rules/dammit/dammit-env.yaml

(so the feature is implemented). Camille fixed a downstream issue with my implementation in ~ dec, so I think it should be fully working now.

talking with @mandyfrazier wanting to revive discussion about incorporating a script to make the tx2gene file from dammit gff3 output. This requires making decisions about which annotation to pick per contig, e.g. sorting by e-value, length of annotation, choosing custom database annotations. Word cloud? (as suggested by @ctb)

Example below, one transcript which should be annotated COX5B. How to programmatically choose that?

Transcript_100014	HMMER	protein_hmm_match	76	273	3.300000e-17	.	.	ID=homology:234985;Name=COX5B;Target=COX5B 19 86 +;Note=Cytochrome c oxidase subunit Vb;accuracy=0.86;env_coords=28 285;Dbxref="Pfam:PF01215.15"
Transcript_100014	LAST	translated_nucleotide_match	92	367	3.200000e-60	-	.	ID=homology:390878;Name=ENSXMAP00000000371;Target=ENSXMAP00000000371 0 90 +;database=OrthoDB
Transcript_100014	LAST	translated_nucleotide_match	92	367	5.700000e-36	-	.	ID=homology:511790;Name=sp|P10606|COX5B_HUMAN;Target=sp|P10606|COX5B_HUMAN 0 93 +;database=sprot
Transcript_100014	shmlast.LAST	conditional_reciprocal_best_LAST	183	224	2.000000e-19	+	.	ID=homology:748371;Name=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 86 128 +;database=kfish2rae5g.pub.aa
Transcript_100014	shmlast.LAST	conditional_reciprocal_best_LAST	232	297	2.100000e-36	+	.	ID=homology:748366;Name=Funhe2EKm029180t6 aalen=116,66%,complete; clen=530; offs=138-488; oid=Funhe2Exx11m009875t6,Funhe2E6bm045748t2; organism=Fundulus_heteroclitus; type=protein; isoform=6; Name=Uncharacterized protein;;Target=Funhe2EKm029180t6 aalen=116,66%,complete; clen=530; offs=138-488; oid=Funhe2Exx11m009875t6,Funhe2E6bm045748t2; organism=Fundulus_heteroclitus; type=protein; isoform=6; Name=Uncharacterized protein; 45 111 +;database=kfish2rae5g.pub.aa
Transcript_100014	shmlast.LAST	conditional_reciprocal_best_LAST	24	118	7.500000e-56	+	.	ID=homology:748378;Name=Funhe2EKm029180t4 aalen=102,99%,partial; clen=309; offs=2-307; oid=Funhe2Exx11m009875t4,Fungr1EG3m027840t2; organism=Fundulus_heteroclitus; type=protein; isoform=4; Name=Cytochrome c oxidase subunit 5B, mitochondrial (73%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t4 aalen=102,99%,partial; clen=309; offs=2-307; oid=Funhe2Exx11m009875t4,Fungr1EG3m027840t2; organism=Fundulus_heteroclitus; type=protein; isoform=4; Name=Cytochrome c oxidase subunit 5B, mitochondrial (73%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 3 98 +;database=kfish2rae5g.pub.aa
Transcript_100014	shmlast.LAST	conditional_reciprocal_best_LAST	24	122	1.000000e-60	+	.	ID=homology:748362;Name=Funhe2EKm029180t3 aalen=131,99%,partial; clen=395; offs=3-395; oid=Funhe2Exx11m009875t3,Fungr1EG3m027840t1; organism=Fundulus_heteroclitus; type=protein; isoform=3; Name=Cytochrome c oxidase subunit 5B, mitochondrial (98%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t3 aalen=131,99%,partial; clen=395; offs=3-395; oid=Funhe2Exx11m009875t3,Fungr1EG3m027840t1; organism=Fundulus_heteroclitus; type=protein; isoform=3; Name=Cytochrome c oxidase subunit 5B, mitochondrial (98%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 0 99 +;database=kfish2rae5g.pub.aa
Transcript_100014	shmlast.LAST	conditional_reciprocal_best_LAST	31	122	2.300000e-56	+	.	ID=homology:1228514;Name=ENSFHEP00000007063.1 pep primary_assembly:Fundulus_heteroclitus-3.0.2:KN811366.1:97972:103002:-1 gene:ENSFHEG00000000204.1 transcript:ENSFHET00000004439.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:cytochrome c oxidase subunit 5B, mitochondrial [Source:NCBI gene;Acc:105931138];Target=ENSFHEP00000007063.1 pep primary_assembly:Fundulus_heteroclitus-3.0.2:KN811366.1:97972:103002:-1 gene:ENSFHEG00000000204.1 transcript:ENSFHET00000004439.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:cytochrome c oxidase subunit 5B, mitochondrial [Source:NCBI gene;Acc:105931138] 0 92 +;database=Fundulus_heteroclitus.Fundulus_heteroclitus-3.0.2.pep.all.fa
Transcript_100014	shmlast.LAST	conditional_reciprocal_best_LAST	31	122	7.500000e-56	+	.	ID=homology:748370;Name=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 0 92 +;database=kfish2rae5g.pub.aa
Transcript_100014	transdecoder	CDS	632	928	.	-	0	ID=cds.Gene.220552::Transcript_100014::g.220552::m.220552;Parent=Gene.220552::Transcript_100014::g.220552::m.220552
Transcript_100014	transdecoder	exon	1	1019	.	-	.	ID=Gene.220552::Transcript_100014::g.220552::m.220552.exon1;Parent=Gene.220552::Transcript_100014::g.220552::m.220552
Transcript_100014	transdecoder	five_prime_UTR	929	1019	.	-	.	ID=Gene.220552::Transcript_100014::g.220552::m.220552.utr5p1;Parent=Gene.220552::Transcript_100014::g.220552::m.220552
Transcript_100014	transdecoder	gene	1	1019	.	-	.	ID=Gene.220552::Transcript_100014::g.220552;Name=ORF%20type%3Acomplete%20len%3A99%20%28-%29
Transcript_100014	transdecoder	mRNA	1	1019	.	-	.	ID=Gene.220552::Transcript_100014::g.220552::m.220552;Parent=Gene.220552::Transcript_100014::g.220552;Name=ORF%20type%3Acomplete%20len%3A99%20%28-%29
Transcript_100014	transdecoder	three_prime_UTR	1	631	.	-	.	ID=Gene.220552::Transcript_100014::g.220552::m.220552.utr3p1;Parent=Gene.220552::Transcript_100014::g.220552::m.220552

Note that some custom amino acid fasta files might have semicolons. These need to be fixed before using in dammit because of this issue with dammit.