How to get Dbxref=GeneID:100287102
schmucr1 opened this issue · 4 comments
Hello !
Would it be possible to get the GeneID (Entrezgene) for the annotations of NCBI genomes ?
I noticed that in the downloaded temporary gff file the GeneID information is available by the annotation field "Dbxref=GeneID:100287102".
However, in the final gtf file there are gene symbols in the "gene_id" fields. This is often redundant with the "gene_name" fields.
Thanks,
Roland
Hey again Roland,
unfortunately we can't control this part of the annotation conversion tools (they are made by UCSC), so we can't keep the original entrezgene IDs.
We can however get the entrezgene IDs back! Have a look at the following python code:
import genomepy
a = genomepy.Annotation("GRCh38.p12", genomes_dir="/projects/site/pred/beda/genomepy_genomes/NCBI/")
# filter the annotation.bed file down to the main chromosomes
bed = a.filter_regex("bed", "^\d|^MT")
# check
set(bed.chrom)
# I noticed the folks at NCBI liked to add "rna-" and "gene-" in front of names. that won't do.
names = bed.name.str.split("-", n=1, expand=True)[1]
bed.name = names
# check
set(bed.name)
# map the gene names to entrezgene IDs
entr = a.map_genes("entrezgene", annot=bed)
# check
bed.shape
entr.shape
# save bed file
genomepy.annotation.utils.write_annot(entr, "GRCh38.p12.annotation.bed")
# generate a gtf file from the bed file
genomepy.annotation.utils.generate_annot("GRCh38.p12.annotation.bed", "GRCh38.p12.annotation.gtf")
I did lose a number of genes during this conversion, which apparently weren't found in the (MyGene) conversion database. Probably this was because of the str.split
, which I updated here to only split once.
Dear Siebren
Thank you for the python code to generate a file with entrezgene annotations.
I did exactly as you suggested.
This gives me entrezgene ids for certain genes/transcripts but for others it concatenates additional integer numbers:
For example,
1 genomepy transcript 89006 174659 . - . gene_id "100996442_13"; transcript_id "100996442_13";
1 genomepy exon 89006 91629 . - . gene_id "100996442_13"; transcript_id "100996442_13"; exon_number "1"; exon_id "100996442_13.1";
1 genomepy exon 92091 92240 . - . gene_id "100996442_13"; transcript_id "100996442_13"; exon_number "2"; exon_id "100996442_13.2";
Moreover, the transcript_id are now also composed of the entrezgene number.
What I am thinking of is a gtf file with this structure
chr19 BestRefSeq exon 58347353 58347640 . - . gene_id "1"; gene_name "A1BG"; transcript_id "NM_130786.4"; description "alpha-1-B glycoprotein";
chr19 BestRefSeq exon 58350370 58350651 . - . gene_id "1"; gene_name "A1BG"; transcript_id "NM_130786.4"; description "alpha-1-B glycoprotein";
chr19 BestRefSeq exon 58351391 58351687 . - . gene_id "1"; gene_name "A1BG"; transcript_id "NM_130786.4"; description "alpha-1-B glycoprotein";
Entrezgene in "gene_id", gene symbol in "gene_name" and mRNA id in "transcript_id".
Is that possible with genomepy, or do you suggest to write my own code to get there?
Many thanks in advance,
R.
You might be able to parse this together from the intermediate files (they can be found here). I tried this, but the UCSC conversion tools that genomepy uses cannot handle the gtf file there.
If you want a richer gene annotation file, you might be better off going with GRCh38.p13 from Ensembl, or one of the hg38 annotations from UCSC:
$ genomepy annotation GRCh38.p13
16:08:48 | INFO | Ensembl
1 ensembl_havana gene 685679 686673 . - . gene_id "ENSG00000284662"; gene_version "1"; gene_name "OR4F16"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
1 ensembl_havana transcript 685679 686673 . - . gene_id "ENSG00000284662"; gene_version "1"; transcript_id "ENST00000332831"; transcript_version "4"; gene_name "OR4F16"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F16-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS41221"; tag "basic"; transcript_support_level "NA (assigned to previous version 3)";
16:09:51 | INFO | NCBI
NC_000001.11 genomepy transcript 11874 12227 . + . gene_id "DDX11L1"; transcript_id "rna-NR_046018.2"; gene_name "DDX11L1";
NC_000001.11 genomepy exon 11874 12227 . + . gene_id "DDX11L1"; transcript_id "rna-NR_046018.2"; exon_number "1"; exon_id "rna-NR_046018.2.1"; gene_name "DDX11L1";
$ genomepy annotation hg38 -p ucsc
16:11:12 | INFO | UCSC ncbiRefSeq
chr1 genomepy transcript 11874 14409 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; gene_name "DDX11L1";
chr1 genomepy exon 11874 12227 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; exon_number "1"; exon_id "NR_046018.2.1"; gene_name "DDX11L1";
16:11:15 | INFO | UCSC refGene
chr1 genomepy transcript 14362 29370 . - . gene_id "WASH7P"; transcript_id "NR_024540"; gene_name "WASH7P";
chr1 genomepy exon 14362 14829 . - . gene_id "WASH7P"; transcript_id "NR_024540"; exon_number "1"; exon_id "NR_024540.1"; gene_name "WASH7P";
16:11:18 | INFO | UCSC knownGene
chr1 genomepy transcript 11869 14409 . + . gene_id "ENST00000456328.2"; transcript_id "ENST00000456328.2";
chr1 genomepy exon 11869 12227 . + . gene_id "ENST00000456328.2"; transcript_id "ENST00000456328.2"; exon_number "1"; exon_id "ENST00000456328.2.1";
Nice!
Thanks also for the workaround solution.