tariqdaouda/pyGeno

Out of frame protein sequences

Closed this issue · 1 comments

Problem: proteins whose translation start sites are not certain gives out of frame sequences.
Solution: Somehow frame of the first exon should be included while generating CDS.

refGenome=Genome(name="GRCh38.80")
refProt=refGenome.get(Protein,id="ENSP00000349216")[0]
print "pyGeno"
print refProt.sequence
gencode_seq="XHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQS
RCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLV
SALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGL
AQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGF
LPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQ
RRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTG
ARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFP
YAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDG
ETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPEREL
GTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC"
print "GENCODE"
print gencode_seq
first_exon_frame=refProt.transcript.exons[0].frame
print first_exon_frame
new_seq= "X"+translateDNA(refProt.transcript.cDNA[0:-3],frame="f"+str(1+first_exon_frame))
print "Corrected sequence"
print new_seq
print showDifferences(gencode_seq,new_seq)

Hi,

Thanks for the issue. pyGeno by default abides by the information provided by Ensembl. But if you know which proteins you are interested into, pyGeno provide tools for translating them into the reading frame of your choosing.

If you don't want to apply an offset simply do:

import pyGeno.tools.UsefulFunctions as uf

uf.translateDNA(refProt.transcript.cDNA, frame = "f2")

If you want to apply an offset:

import pyGeno.tools.UsefulFunctions as uf

#get the exons
exons = refProt.transcript.exons

#apply the offset to the first exon
CDS1 = refProt.chromosome.getSequence(e.CDS_start -1, e.CDS_end)

#this will contain the shifted sequence
protSeq = [CDS1]
#loop through the other exons
for e in exons[1:]:
  protSeq.append(e.CDS)

#concatenate the sequences
protSeq = ''.join(protSeq)

uf.translateDNA(protSeq, frame = "f2")