broadinstitute/SpliceAI-lookup

Wrong file on annotations download site

Closed this issue · 2 comments

jzluo commented

In https://spliceailookup-api.broadinstitute.org/annotations,
gencode.v44.basic.annotation.txt.gz is just a compressed gencode.v44.basic.annotation.transcript_annotations.json.

bw2 commented

I don't understand the issue.

The file contents of gencode.v44.basic.annotation.txt.gz and gencode.v44.basic.annotation.transcript_annotations.json are quite different:

$ curl -s https://spliceailookup-api.broadinstitute.org/annotations/gencode.v44.basic.annotation.transcript_annotations.json | head -n 30
{
    "ENST00000000233": {
        "g_id": "ENSG00000004059.11",
        "g_name": "ARF5",
        "t_id": "ENST00000000233.10",
        "t_priority": "MS",
        "t_refseq_ids": [
            "NM_001662.4"
        ],
        "t_strand": "+",
        "t_type": "protein_coding"
    },
    "ENST00000000412": {
        "g_id": "ENSG00000003056.8",
        "g_name": "M6PR",
        "t_id": "ENST00000000412.8",
        "t_priority": "MS",
        "t_refseq_ids": [
            "NM_002355.4"
        ],
        "t_strand": "-",
...

and


$ curl -s https://spliceailookup-api.broadinstitute.org/annotations/gencode.v44.basic.annotation.txt.gz | gunzip -c - 
#NAME   CHROM   STRAND  TX_START        TX_END  EXON_START      EXON_END
ENST00000456328.2       chr1    +       11868   14409   11868,12612,13220,      12227,12721,14409,
ENST00000450305.2       chr1    +       12009   13670   12009,12178,12612,12974,13220,13452,    12057,12227,12697,13052,13374,13670,
ENST00000488147.1       chr1    -       14403   29570   14403,15004,15795,16606,16857,17232,17605,17914,18267,24737,29533,      14501,15038,15947,16765,17055,17368,17742,18061,18366,24891,29570,
ENST00000619216.1       chr1    -       17368   17436   17368,  17436,
ENST00000473358.1       chr1    +       29553   31097   29553,30563,30975,      30039,30667,31097,
ENST00000607096.1       chr1    +       30365   30503   30365,  30503,
ENST00000417324.1       chr1    -       34553   36081   34553,35276,35720,      35174,35481,36081,
ENST00000606857.1       chr1    +       52472   53312   52472,  53312,
ENST00000642116.1       chr1    +       57597   64116   57597,58699,62915,      57653,58856,64116,
...
jzluo commented

Sorry, I'm not sure what happened. The file I downloaded twice earlier today is different, but redownloading worked just now.
old file:

$ zcat gencode.v44.basic.annotation.txt.gz | head
{
    "ENST00000000233": {
        "g_id": "ENSG00000004059.11",
        "g_name": "ARF5",
        "t_id": "ENST00000000233.10",
        "t_priority": "MS",
        "t_refseq_ids": [
            "NM_001662.4"
        ],
        "t_strand": "+",

redownload:

$ zcat gencode.v44.basic.annotation.txt.gz | head
#NAME	CHROM	STRAND	TX_START	TX_END	EXON_START	EXON_END
ENST00000456328.2	chr1	+	11868	14409	11868,12612,13220,	12227,12721,14409,
ENST00000450305.2	chr1	+	12009	13670	12009,12178,12612,12974,13220,13452,	12057,12227,12697,13052,13374,13670,
ENST00000488147.1	chr1	-	14403	29570	14403,15004,15795,16606,16857,17232,17605,17914,18267,24737,29533,	14501,15038,15947,16765,17055,17368,17742,18061,18366,24891,29570,
ENST00000619216.1	chr1	-	17368	17436	17368,	17436,
ENST00000473358.1	chr1	+	29553	31097	29553,30563,30975,	30039,30667,31097,
ENST00000607096.1	chr1	+	30365	30503	30365,	30503,
ENST00000417324.1	chr1	-	34553	36081	34553,35276,35720,	35174,35481,36081,
ENST00000606857.1	chr1	+	52472	53312	52472,	53312,
ENST00000642116.1	chr1	+	57597	64116	57597,58699,62915,	57653,58856,64116,