Why are some gene names present in `genomic_contexts` but not in `ensembl_ids`?
Closed this issue · 7 comments
Hi,
First thank you for such a great package.
I have been working on retrieval of gene data of certain variants through gwasrapidd package. I realized that variants can have incompatible gene data in ensembl_ids and genomic_context segments.
For example, let assume I retrieve data of a variant using get_variants function. Some gene names of the variant might be different in the ensembl_ids table (or segment) than in the genomic_context table (or segment).
What could be the reason for this difference?
What is the difference between genomic_context and ensembl_ids of a variant in terms of gene?
Unfortunately, today i cannot reach gwas through gwasrapidd package. When i run the functions, i have retrieved zero data. Thus, i cannot add any example files.
Hi @mzzclb
Thank you for reaching out.
Because I am also having trouble retrieving data from the GWAS Catalog I can't check the issue you are reporting.
For the moment, check whether your problem might be related to this question: https://rmagno.eu/gwasrapidd/articles/faq.html#genomic-coordinates-of-genomic-contexts-seem-to-be-wrong.
Meanwhile I will check with the GWAS Catalog team why the server is not responding.
Thank you for replying.
What I mentioned is not really related to the topic above at the link.
I mean that a variant can have different gene clusters in genomic_context and ensembl_ids segments.
Could you examine the pdf file I added as an example? I created it from rmarkdown.
ensembl_ids-and-genomic_context-of-a-variant.pdf
Hi @mzzclb
The GWAS Catalog is running well again, so perhaps you could provide a specific example illustrating your question. I will try to answer nevertheless based on what you wrote.
The genomic_contexts
table provides all Ensembl and RefSeq genes mapping within 50kb upstream and downstream of each GWAS Catalog variant.
Then, a specific gene is typically associated with one Ensembl identifier only but there are cases when it is associated with more than one Ensembl identifier, e.g. a gene locates in the haplotypic MHC region, see discussion here. The table ensembl_ids
provides that info.
Here is an example:
library(gwasrapidd)
my_variants <- get_variants(variant_id = "rs2269423")
print(my_variants@genomic_contexts, n = 20)
#> # A tibble: 200 × 12
#> variant_id gene_name chromosome_name chromosome_position distance
#> <chr> <chr> <chr> <int> <int>
#> 1 rs2269423 FKBPL 6 32177930 47642
#> 2 rs2269423 PPT2 6 32177930 14252
#> 3 rs2269423 TNXB 6 32177930 68592
#> 4 rs2269423 NOTCH4 6 32177930 16913
#> 5 rs2269423 RNA5SP206 6 32177930 99302
#> 6 rs2269423 RNA5SP206 6 32177930 99302
#> 7 rs2269423 TSBP1-AS1 6 32177930 76710
#> 8 rs2269423 PPT2-EGFL8 6 32177930 5952
#> 9 rs2269423 FKBPL 6 32177930 47642
#> 10 rs2269423 GPSM3 6 32177930 12836
#> 11 rs2269423 PBX2 6 32177930 6803
#> 12 rs2269423 MIR6721 6 32177930 7814
#> 13 rs2269423 ATF6B 6 32177930 49677
#> 14 rs2269423 EGFL8 6 32177930 9649
#> 15 rs2269423 NOTCH4 6 32177930 16913
#> 16 rs2269423 LOC100507547 6 32177930 23565
#> 17 rs2269423 TNXB 6 32177930 62596
#> 18 rs2269423 AGPAT1 6 32177930 0
#> 19 rs2269423 MIR6833 6 32177930 1886
#> 20 rs2269423 PPT2 6 32177930 14255
#> # ℹ 180 more rows
#> # ℹ 7 more variables: is_mapped_gene <lgl>, is_closest_gene <lgl>,
#> # is_intergenic <lgl>, is_upstream <lgl>, is_downstream <lgl>, source <chr>,
#> # mapping_method <chr>
print(my_variants@ensembl_ids, n = 20)
#> # A tibble: 77 × 3
#> variant_id gene_name ensembl_id
#> <chr> <chr> <chr>
#> 1 rs2269423 FKBPL ENSG00000224200
#> 2 rs2269423 FKBPL ENSG00000204315
#> 3 rs2269423 FKBPL ENSG00000223666
#> 4 rs2269423 FKBPL ENSG00000230907
#> 5 rs2269423 PPT2 ENSG00000228116
#> 6 rs2269423 PPT2 ENSG00000206329
#> 7 rs2269423 PPT2 ENSG00000168452
#> 8 rs2269423 PPT2 ENSG00000206256
#> 9 rs2269423 PPT2 ENSG00000236649
#> 10 rs2269423 PPT2 ENSG00000221988
#> 11 rs2269423 PPT2 ENSG00000231618
#> 12 rs2269423 TNXB ENSG00000168477
#> 13 rs2269423 TNXB ENSG00000236236
#> 14 rs2269423 TNXB ENSG00000206258
#> 15 rs2269423 TNXB ENSG00000229353
#> 16 rs2269423 TNXB ENSG00000233323
#> 17 rs2269423 TNXB ENSG00000231608
#> 18 rs2269423 NOTCH4 ENSG00000235396
#> 19 rs2269423 NOTCH4 ENSG00000223355
#> 20 rs2269423 NOTCH4 ENSG00000204301
#> # ℹ 57 more rows
Created on 2023-07-04 with reprex v2.0.2
Hi @ramiromagno,
Thank you for your time.
What i mentioned is not related to different ensembl ids assigning to teh same gene.
A variant can have different gene clusters in genomic_context and ensembl_ids segments.
Could you examine the code pasted below?
The genes of HCG23 and LOC105379657 are available in the ensembl_ids segment of the given variant although none of them is in the genomic_context segment.
library(gwasrapidd) rs137931178 <- gwasrapidd::get_variants(variant_id = "rs137931178") # I have checked rs13793117 as an example unique_genes_of_rs137931178_in_genomic_context <- unique(rs137931178@genomic_contexts$gene_name) unique_genes_of_rs137931178_in_ensembl_ids <- unique(rs137931178@ensembl_ids$gene_name) genes_of_genomic_context_of_rs137931178_not_in_ensembl_ids_rs137931178 <- setdiff(unique_genes_of_rs137931178_in_genomic_context,unique_genes_of_rs137931178_in_ensembl_ids) print(genes_of_genomic_context_of_rs137931178_not_in_ensembl_ids_rs137931178) # HCG23 and LOC105379657 are available in the ensembl_ids segment although none of them is in the genomic_context segment.
Why are some genes not included in the gene group in ensembl_ids segment of the variant?
Hi @mzzclb,
I think I understand your question now, although I also think you've written the opposite of what you meant at the certain point. But please tell me otherwise.
So, in principle, you can have more gene names included in genomic_contexts
than in ensembl_ids
table but not the other way around. In your example that is the case. You have HCG23 and LOC105379657 in genomic_contexts
but not in ensembl_ids
. The reverse does not happen, i.e. you don't have a gene name showing up in ensembl_ids
that would be missing from genomic_contexts
.
When you wrote:
The genes of HCG23 and LOC105379657 are available in the ensembl_ids segment of the given variant although none of them is in the genomic_context segment.
I think you meant the other way around because HCG23 and LOC105379657 are available in the genomic_contexts
table but not in ensembl_ids
.
So why is it normal to have some gene names in the genomic_contexts
but not in the table ensembl_ids
. Well, like I said earlier, the genomic_contexts
table provides all Ensembl and RefSeq genes mapping within 50kb upstream and downstream of each GWAS Catalog variant. However, only Ensembl genes have associated Ensembl identifiers. So there are RefSeq genes that either have other names in Ensembl or are non-existent at all, and therefore do not have an associated Ensembl identifier. The two cases you report are examples of each of these cases:
- The RefSeq gene HCG23 is known as TSBP1-AS1 in Ensembl. Note that TSBP1-AS1 is present both in
genomic_contexts
and inensembl_ids
. - The RefSeq gene LOC105379657 is the name of a gene used by the NCBI when a published symbol is not available, i.e. orthologs have not yet been determined and hence the gene will provide a symbol that is constructed as 'LOC' + the GeneID. Again, this gene name only makes sense in the context of the NCBI system, not Ensembl's, so it has not an associated Ensembl identifier.
I hope this helps.
Thank you very much @ramiromagno
You're welcome!