dputhier/libgtftk

get_sequence

Closed this issue · 9 comments

ENST00000624697 sequence has an additional T compare to

http://www.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core;g=ENSG00000187608;r=1:1001145-1014435;t=ENST00000624697

CGGCCTCACGCGTCCGTGCAGCGGAGGCTTCCTGAGCCCCCTGGAGAGCCTGGCCTGGGCCCGGGTGTGGAGACCCTCCCGGGCTTTCAATCCGGGCAGGAGGCAGATGGCAGACTCAGCAGTCACGTAAGAGAACCGAATTAGGAATCGCTTAATTCTGAAAGTCTGGTGAGAAGACACGGCGAGAATCGGGGTCCAGCACAGATGATGGCGACAGCGGAGAAGGGAAGGGCTGGGACCTGACGGTGAAGATGCTGGCGGGCAACGAATTCCAGGTGTCCCTGAGCAGCTCCATGTCGGTGTCAGAGCTGAAGGCGCAGATCACCCAGAAGATCGGCGTGCACGCCTTCCAGCAGCGTCTGGCTGTCCACCCGAGCGGTGTGGCGCTGCAGGACAGGGTCCCCCTTGCCAGCCAGGGCCTGGGCCCCGGCAGCACGGTCCTGCTGGTGGTGGACAAATGCGACGAACCTCTGAGCATCCTGGTGAGGAATAACAAGGGCCGCAGCAGCACCTACGAGGTACGGCTGACGCAGACCGTGGCCCACCTGAAGCAGCAAGTGAGCGGGCTGGAGGGTGTGCAGGACGACCTGTTCTGGCTGACCTTCGAGGGGAAGCCCCTGGAGGACCAGCTCCCGCTGGGGGAGTACGGCCTCAAGCCCCTGAGCACCGTGTTCATGAATCTGCGCCTGCGGGGAGGCGGCACAGAGCCTGGCGGGCGGAGCTAAGGGCCTCCACCAGCATCCGAGCAGGATCAAGGGCCGGAAATAAAGGCTGTTGTAAAGAGAAAT

When I use the get_sequence function of the C library on ENST00000624697 transcript, I get the right sequence, without the final T

I guess this is a release issue. To be checked.

I just did a push in the develop branch. I still need to reorganize this git repository ...

Didn't see this push in develop branch. It seems I'm already up to date.

I mean an ensembl release issue.

I use this version of the genome : Homo_sapiens.GRCh38.dna.toplevel.fa (size = 48669025517)
and this version of the GTF : Homo_sapiens.GRCh38.94.gtf (size = 1163315582)

First, I get all that is related with the transcript in the gtf :
gtftk --select-by-key transcript_id:ENST00000624697 Homo_sapiens.GRCh38.94.gtf > ENST00000624697.gtf
And I get the sequence :
gtftk --get-fasta Homo_sapiens.GRCh38.dna.toplevel.fa ENST00000624697.gtf

>ENSG00000187608_ISG15_ENST00000624697_1:1001138-1014540_+_mRNA
CGGCCTCACGCGTCCGTGCAGCGGAGGCTTCCTGAGCCCCCTGGAGAGCCTGGCCTGGGC
CCGGGTGTGGAGACCCTCCCGGGCTTTCAATCCGGGCAGGAGGCAGATGGCAGACTCAGC
AGTCACGTAAGAGAACCGAATTAGGAATCGCTTAATTCTGAAAGTCTGGTGAGAAGACAC
GGCGAGAATCGGGGTCCAGCACAGATGATGGCGACAGCGGAGAAGGGAAGGGCTGGGACC
TGACGGTGAAGATGCTGGCGGGCAACGAATTCCAGGTGTCCCTGAGCAGCTCCATGTCGG
TGTCAGAGCTGAAGGCGCAGATCACCCAGAAGATCGGCGTGCACGCCTTCCAGCAGCGTC
TGGCTGTCCACCCGAGCGGTGTGGCGCTGCAGGACAGGGTCCCCCTTGCCAGCCAGGGCC
TGGGCCCCGGCAGCACGGTCCTGCTGGTGGTGGACAAATGCGACGAACCTCTGAGCATCC
TGGTGAGGAATAACAAGGGCCGCAGCAGCACCTACGAGGTACGGCTGACGCAGACCGTGG
CCCACCTGAAGCAGCAAGTGAGCGGGCTGGAGGGTGTGCAGGACGACCTGTTCTGGCTGA
CCTTCGAGGGGAAGCCCCTGGAGGACCAGCTCCCGCTGGGGGAGTACGGCCTCAAGCCCC
TGAGCACCGTGTTCATGAATCTGCGCCTGCGGGGAGGCGGCACAGAGCCTGGCGGGCGGA
GCTAAGGGCCTCCACCAGCATCCGAGCAGGATCAAGGGCCGGAAATAAAGGCTGTTGTAA
AGAGAAA

I will close this one because I got the same results as you and I'm not able, anymore (???), to find the 'T' at the end of the sequence obtained from ensembl. Mysterious...

http://www.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core;g=ENSG00000187608;r=1:1001145-1014435;t=ENST00000624697

OK. This difference of 1 nucleotide is really due to a difference in ensembl version. The mini_real dataset (and its derivatives) and current version of ensembl contain coordinates for ENST00000624697 that differ from 1 nucleotide:

minireal:

chr1 1001137 1001281 ENSG00000187608|ENST00000624697 0 +
chr1 1008193 1008279 ENSG00000187608|ENST00000624697 0 +
chr1 1013983 1014541 ENSG00000187608|ENST00000624697 0 +

Current ensembl version

chr1 1001137 1001281 ENSG00000187608|ENST00000624697 . +
chr1 1008193 1008279 ENSG00000187608|ENST00000624697 . +
chr1 1013983 1014540 ENSG00000187608|ENST00000624697 . +

So the expected size using mini_real dataset is 788 not 787...

The test was fixed (for this guy) in b8c8f4630f960875313b90926202975e9b2bab6f

The same problem was observed with ENST00000635405 (observed size 529, expected 4457). Here again, the problem is related to the fact that sequences were obtained from a more recent release.

fix in bc4452d34e412cb5f5d77b135ab2757e04fb0322.