discrepancies between get_variant and get_association
Closed this issue · 10 comments
I was expecting these two different search strategies to identify the same variant IDs but they don't. Do you know why they perform differently?
trait="Plasma omega-6 polyunsaturated fatty acid levels (arachidonic acid)"
efo="arachidonic acid measurement"
efo_id=NULL
Strategy 1. identify variant IDs using combination of get_studies and get_associations
gwas_studies<-gwasrapidd::get_studies(efo_trait = efo,efo_id=efo_id,reported_trait=trait)
gwas_associations<-gwasrapidd::get_associations(study_id = gwas_studies@studies$study_id)
> gwas_associations@risk_alleles$variant_id
[1] "rs2581624" "rs174545" "rs4246215" "rs174601" "rs8523"
[6] "rs174549" "rs174541" "rs174549" "rs174550" "rs12580543"
[11] "rs3811444" "rs174535" "rs2110073" "kgp12662626" "rs7258177"
[16] "rs174528" "rs174577" "kgp12035941" "rs1795851" "rs1404384"
[21] "rs16979306" "rs6133127" "rs1688589" "rs1539053" "kgp9433132"
[26] "rs1882496" "kgp6577813" "kgp2178364" "rs12714668" "rs1080261"
[31] "rs9942436" "rs2637523" "rs10839732" "rs12471016" "rs16829840"
[36] "rs274557" "rs9394931" "rs12209128" "rs174547" "rs1741"
[41] "rs102275"
Strategy 2. identify variant IDs using get_variants
gwas_variants<-gwasrapidd::get_variants(efo_trait = efo,efo_id=efo_id,reported_trait=trait)
> gwas_variants@variants$variant_id
[1] "rs1080261" "rs7258177" "rs12714668" "rs16979306" "rs174528"
[6] "rs9942436" "kgp2178364" "rs174577" "kgp12035941" "kgp6577813"
[11] "rs1795851" "rs2637523" "rs1882496" "rs1539053" "rs174545"
[16] "rs1404384" "rs6133127" "kgp12662626" "kgp9433132" "rs1688589"
[21] "rs10839732" "rs2581624" "rs274557" "rs1741" "rs12471016"
[26] "rs174547" "rs12209128" "rs9394931" "rs102275" "rs16829840"
Why are the two sets of variants different?
Hi @mightyphil2000:
The example is not complete because the definition of the variable efo_id
is missing. Could you provide that? May I suggest you use the reprex function to generate the minimal working example?
Furthermore, by searching by all those criteria simultaneously, do you intend to retrieve matches that fulfill all criteria, or that fulfill (at least) one of the criteria?
Sorry about. I've updated the original post with efo_id set to NULL (which is how I've been setting it). I'm looking for matches that fulfill at least one of the criteria.
Okay, thank you. I will take a look at it tomorrow.
Hi @mightyphil2000:
I just looked into your example, and I do not have a good explanation for why we are seeing these differences; we should be getting the same variants either way. As far as I could inspect into gwasrapidd code, there seems to be no problem with gwasrapidd itself. So I forwarded your example to the GWAS Catalog team hoping to get some clarification on this issue. I will let you know as soon as I have a reply (likely this Monday).
Hi @mightyphil2000,
I got a reply from Elliot Sollis (with the GWAS Catalog team).
Short answer
Use gwasrapidd::get_variants()
function.
Explanation
Like you, I too had the expectation that both approaches would retrieve the same results. But apparently there are historical reasons that explain these differences for a subset of the older data in the Catalog.
Here is a post of the emails we exchanged (with permission from Elliot):
My question
Hi GWAS Catalog team,
I am getting a different number of variants depending on the query route I make using the REST API.
My query is based on the reported trait, i.e., "arachidonic acid measurement". I tried two approaches, here dubbed approach 1 and 2.
Approach 1: Search for studies by "arachidonic acid measurement" (which results in GCST002712 and GCST003236), then search for the respective associations, and from those associations retrieve the variants.
Approach 2: Search for variants directly by "arachidonic acid measurement".
I find that all variants obtained with approach 2 are included in those obtained with approach 1. But there is a set of variants that are exclusive to approach 1. Why? Is this a bug?
Here are the variants that are exclusive to approach 1:
# A tibble: 11 x 7
association_id locus_id variant_id risk_allele risk_frequency genome_wide limited_list
<chr> <int> <chr> <chr> <dbl> <lgl> <lgl>
1 10079997 1 rs4246215 NA 0.36 FALSE FALSE
2 10079998 1 rs174601 NA 0.35 FALSE FALSE
3 10079999 1 rs8523 A 0.44 FALSE FALSE
4 10080000 1 rs174549 NA 0.29 FALSE FALSE
5 10080001 1 rs174541 NA 0.36 FALSE FALSE
6 10080002 1 rs174549 NA 0.29 FALSE FALSE
7 10080003 1 rs174550 NA 0.33 FALSE FALSE
8 10080004 1 rs12580543 C 0.06 FALSE FALSE
9 10080005 1 rs3811444 T 0.32 FALSE FALSE
10 10080006 1 rs174535 NA 0.33 FALSE FALSE
11 10080007 1 rs2110073 T 0.09 FALSE FALSE
Elliot's reply
Hi Ramiro,
I think I’ve figured this one out.
To explain it, I think I just need to confirm some of the terminology to make sure we’re on the same page: sorry if it seems pedantic! The reported trait is a free-text trait description that we annotate independently for each study. The term you used in your query, “arachidonic acid measurement” is what we just call the trait, which is the more standardised ontology term that is more comparable across different studies.
What’s relevant here is that for some older publications that included multiple GWAS analyses (e.g. 25500335 from 2014), technical limitations meant that we used to keep all of the analyses together in a single study entry (GCST002712) with a broad reported trait (in this case “Red blood cell fatty acid levels”), and put multiple ontology terms for each of the traits analysed in the trait field (in this case arachidonic acid measurement, fatty acid measurement, linolenic acid measurement, oleic acid measurement, docosapentaenoic acid measurement, linoleic acid measurement). So if you search for any of those trait terms, this study will come up. But be aware that not all variants in the study are associated with every one of those terms.
That’s at the level of the study, but at the level of the association we were able to be more specific, so each association is only annotated with ontology terms for the specific analysis that association came from, e.g. in that same study rs2581624-C has [arachidonic acid measurement, fatty acid measurement] but rs4246215-? has [fatty acid measurement, linoleic acid measurement]. That means that if you search by “arachidonic acid measurement” you will get rs2581624 but not rs4246215.
We now try to split every publication into separate studies for every GWAS analysis, but I’m not sure if we’ll necessarily be able to adjust all of the older studies. So I would suggest that Approach 2 is a better option if you want to make sure you only capture direct associations with a particular trait.
Hope that helps!
Best wishes,
Elliot Sollis, PhD
Scientific Curator, GWAS Catalog
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge, CB10 1SD
Thanks @ramiromagno. I see now. A problem with the get_variants function is that you can't search on reported trait (I guess because the older studies are not sufficiently annotated at the association level?). I am concerned that by only searching on EFO I might miss some associations. This could happen when the EFO is less specific than the reported trait. For example, rs4770891-? is associated with the reported trait "Circulating odd-numbered chain saturated fatty acid levels (C19:0)" but with EFO "fatty acid measurement" in study GCST005862. If I was specifically interested in the saturated fatty acid C19:0, then searching on EFO would not identify this association, whereas searching for studies on reported trait would. In other words, you could argue that search strategy 1 (search for studies by reported trait then retrieve associations) is more sensitive (but suffers more false positives), while search strategy 2 (search for variant-EFO associations) is more specific (but suffers more false negatives). What do you think?
Hi @mightyphil2000:
I have divided this answer into three points.
Point 1
A problem with the get_variants function is that you can’t search on
reported trait (I guess because the older studies are not sufficiently
annotated at the association level?). I am concerned that by only
searching on EFO I might miss some associations.
I am a bit confused now. You say you cannot search on report_trait
with the get_variants()
function, but you can, right?
my_variants_by_rep_trait <- get_variants(reported_trait = "Circulating odd-numbered chain saturated fatty acid levels (C19:0)")
my_variants_by_efo_trait <- get_variants(efo_trait = "fatty acid measurement")
my_variants_by_rep_trait@variants
## # A tibble: 15 x 7
## variant_id merged functional_class chromosome_name chromosome_position
## <chr> <int> <chr> <chr> <int>
## 1 rs1468510 0 3_prime_UTR_variant 7 151360599
## 2 rs11842790 0 intron_variant 13 50350312
## 3 rs13226693 0 intron_variant 7 151353528
## 4 rs17074145 0 intron_variant 13 50355253
## 5 rs17363566 0 intron_variant 13 50360756
## 6 rs706603 0 intron_variant 13 50280887
## 7 rs12871645 0 intron_variant 13 50357429
## 8 rs12703118 0 intron_variant 7 151345197
## 9 rs12853498 0 intron_variant 13 50339738
## 10 rs17074143 0 intron_variant 13 50348637
## 11 rs13221923 0 intron_variant 7 151357494
## 12 rs17074093 0 intron_variant 13 50314567
## 13 rs12874827 0 intron_variant 13 50326600
## 14 rs12874278 0 intron_variant 13 50361786
## 15 rs4770891 0 intron_variant 13 25983588
## # … with 2 more variables: chromosome_region <chr>, last_update_date <dttm>
my_variants_by_efo_trait@variants
## # A tibble: 156 x 7
## variant_id merged functional_class chromosome_name chromosome_posit…
## <chr> <int> <chr> <chr> <int>
## 1 rs1604805 0 intron_variant 4 21569464
## 2 rs7198595 0 intron_variant 16 12471050
## 3 rs12098564 0 non_coding_transcript_ex… 10 85193571
## 4 rs11842790 0 intron_variant 13 50350312
## 5 rs788076 0 intergenic_variant 10 29047920
## 6 rs742614 0 intergenic_variant 20 33894826
## 7 rs10518201 0 regulatory_region_variant 4 78730706
## 8 rs12871645 0 intron_variant 13 50357429
## 9 rs7534537 0 intron_variant 1 204305391
## 10 rs2118674 0 intron_variant 2 170462384
## # … with 146 more rows, and 2 more variables: chromosome_region <chr>,
## # last_update_date <dttm>
Point 2
This could happen when the EFO is less specific than the reported
trait. For example, rs4770891-? is associated with the reported trait
“Circulating odd-numbered chain saturated fatty acid levels (C19:0)”
but with EFO “fatty acid measurement” in study GCST005862.
I did not quite understand your “but” clause. Regardless, note that the
variant you mentioned, i.e., rs4770891
is present in both sets of
variants above:
'rs4770891' %in% my_variants_by_rep_trait@variants$variant_id
## [1] TRUE
'rs4770891' %in% my_variants_by_efo_trait@variants$variant_id
## [1] TRUE
Point 3
If I was specifically interested in the saturated fatty acid C19:0,
then searching on EFO would not identify this association, whereas
searching for studies on reported trait would. In other words, you
could argue that search strategy 1 (search for studies by reported
trait then retrieve associations) is more sensitive (but suffers more
false positives), while search strategy 2 (search for variant-EFO
associations) is more specific (but suffers more false negatives).
What do you think?
I could not follow your reasoning because I did not observe the
differences (as shown above) you are referring to for this specific
example… So perhaps you could illustrate your concerns with some code?
Sorry I confused the get_variant with get_association function. The get_association function does not allow one to search on reported trait. I should have said "and" instead of "but" in point 2. I'll try to put some code together to illustrate what I mean in point 3. Thanks!
Actually I think this specific issue is now resolved. I can see that in this particular example get_variants is the best option. The new problem I was trying to explain is actually to do with the get_associations function. I'm therefore going to post this issue on the get associations issue page I opened the other day.
Okay, I will close this issue then.