vanheeringen-lab/genomepy

How to get Dbxref=GeneID:100287102

schmucr1 opened this issue · 4 comments

Hello !

Would it be possible to get the GeneID (Entrezgene) for the annotations of NCBI genomes ?
I noticed that in the downloaded temporary gff file the GeneID information is available by the annotation field "Dbxref=GeneID:100287102".
However, in the final gtf file there are gene symbols in the "gene_id" fields. This is often redundant with the "gene_name" fields.

Thanks,
Roland

Hey again Roland,

unfortunately we can't control this part of the annotation conversion tools (they are made by UCSC), so we can't keep the original entrezgene IDs.

We can however get the entrezgene IDs back! Have a look at the following python code:

import genomepy

a = genomepy.Annotation("GRCh38.p12", genomes_dir="/projects/site/pred/beda/genomepy_genomes/NCBI/")

# filter the annotation.bed file down to the main chromosomes
bed = a.filter_regex("bed", "^\d|^MT")

# check
set(bed.chrom)

# I noticed the folks at NCBI liked to add "rna-" and "gene-" in front of names. that won't do.
names = bed.name.str.split("-", n=1, expand=True)[1]
bed.name = names

# check
set(bed.name)

# map the gene names to entrezgene IDs
entr = a.map_genes("entrezgene", annot=bed)

# check
bed.shape
entr.shape

# save bed file
genomepy.annotation.utils.write_annot(entr, "GRCh38.p12.annotation.bed")

# generate a gtf file from the bed file
genomepy.annotation.utils.generate_annot("GRCh38.p12.annotation.bed", "GRCh38.p12.annotation.gtf")

I did lose a number of genes during this conversion, which apparently weren't found in the (MyGene) conversion database. Probably this was because of the str.split, which I updated here to only split once.

Dear Siebren

Thank you for the python code to generate a file with entrezgene annotations.
I did exactly as you suggested.

This gives me entrezgene ids for certain genes/transcripts but for others it concatenates additional integer numbers:
For example,

1       genomepy        transcript      89006   174659  .       -       .       gene_id "100996442_13"; transcript_id "100996442_13"; 
1       genomepy        exon    89006   91629   .       -       .       gene_id "100996442_13"; transcript_id "100996442_13"; exon_number "1"; exon_id "100996442_13.1";
1       genomepy        exon    92091   92240   .       -       .       gene_id "100996442_13"; transcript_id "100996442_13"; exon_number "2"; exon_id "100996442_13.2";

Moreover, the transcript_id are now also composed of the entrezgene number.
What I am thinking of is a gtf file with this structure

chr19   BestRefSeq      exon    58347353        58347640        .       -       .       gene_id "1"; gene_name "A1BG"; transcript_id "NM_130786.4"; description "alpha-1-B glycoprotein";
chr19   BestRefSeq      exon    58350370        58350651        .       -       .       gene_id "1"; gene_name "A1BG"; transcript_id "NM_130786.4"; description "alpha-1-B glycoprotein";
chr19   BestRefSeq      exon    58351391        58351687        .       -       .       gene_id "1"; gene_name "A1BG"; transcript_id "NM_130786.4"; description "alpha-1-B glycoprotein";

Entrezgene in "gene_id", gene symbol in "gene_name" and mRNA id in "transcript_id".
Is that possible with genomepy, or do you suggest to write my own code to get there?

Many thanks in advance,
R.

You might be able to parse this together from the intermediate files (they can be found here). I tried this, but the UCSC conversion tools that genomepy uses cannot handle the gtf file there.

If you want a richer gene annotation file, you might be better off going with GRCh38.p13 from Ensembl, or one of the hg38 annotations from UCSC:

$ genomepy annotation GRCh38.p13
16:08:48 | INFO | Ensembl
1       ensembl_havana  gene    685679  686673  .       -       .       gene_id "ENSG00000284662"; gene_version "1"; gene_name "OR4F16"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
1       ensembl_havana  transcript      685679  686673  .       -       .       gene_id "ENSG00000284662"; gene_version "1"; transcript_id "ENST00000332831"; transcript_version "4"; gene_name "OR4F16"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F16-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS41221"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";
16:09:51 | INFO | NCBI
NC_000001.11    genomepy        transcript      11874   12227   .       +       .       gene_id "DDX11L1"; transcript_id "rna-NR_046018.2";  gene_name "DDX11L1";
NC_000001.11    genomepy        exon    11874   12227   .       +       .       gene_id "DDX11L1"; transcript_id "rna-NR_046018.2"; exon_number "1"; exon_id "rna-NR_046018.2.1"; gene_name "DDX11L1";

$ genomepy annotation hg38 -p ucsc
16:11:12 | INFO | UCSC ncbiRefSeq
chr1    genomepy        transcript      11874   14409   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2";  gene_name "DDX11L1";
chr1    genomepy        exon    11874   12227   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; exon_number "1"; exon_id "NR_046018.2.1"; gene_name "DDX11L1";
16:11:15 | INFO | UCSC refGene
chr1    genomepy        transcript      14362   29370   .       -       .       gene_id "WASH7P"; transcript_id "NR_024540";  gene_name "WASH7P";
chr1    genomepy        exon    14362   14829   .       -       .       gene_id "WASH7P"; transcript_id "NR_024540"; exon_number "1"; exon_id "NR_024540.1"; gene_name "WASH7P";
16:11:18 | INFO | UCSC knownGene
chr1    genomepy        transcript      11869   14409   .       +       .       gene_id "ENST00000456328.2"; transcript_id "ENST00000456328.2";
chr1    genomepy        exon    11869   12227   .       +       .       gene_id "ENST00000456328.2"; transcript_id "ENST00000456328.2"; exon_number "1"; exon_id "ENST00000456328.2.1";

Nice!
Thanks also for the workaround solution.