release prep
brentp opened this issue ยท 46 comments
It's time to make a new release this issue will track what needs to be done. Please add any essential (and only essential) updates as comments to this issue:
- update clinvar
- update CADD to v1.4
- update dbsnp
- update gnomad and controls and non-neuro control AFs (gnomad 2.1)
- update dgidb url
- for x-linked recessive exclude sites where the male is het and parents are hom ref. (see #903)
- #843 (don't report x-linked DN's when one parent is missing).
- #902 update gene_summary and gene_detailed
When you update gnomad, will you include the subcohorts (i.e. frequencies from the controls-only and non-neuro cohorts)?
I updated the list to include those items. thanks for the reminders
Taking de novos into account in the comp_het inheritance model would be nice!
I think it would be great to add support for de novos when reporting candidate CHs. The challenge is that one cannot be confident that it causes a true CH without read-backed phasing in the common case of trios. As such, this would have to either be part of the relaxed priority CH searches or a specific mode.
Yeah, the variant could be on the same allele as the inherited one and thus not be a real comp_het, but perhaps a new lenient mode, or a flag for these specific cases would help. Also, this applies for diploid organisms (human specifically, in my case), for other ploidies, things could get ugly fast.
Agree it'd be nice. I've used this hack to get at the results:
Run comp_hets --max-priority 2
which will give you all potential CH pairs including those that are de_novo + 2nd allele. This yields a lot of false positives, but you can winnow down the results by cross-referencing the gene list with the results of de_novo
and looking for overlapping genes.
I'll see what I can do about comp-het + DN. Though it can't be phased, it could be a special category.
Looks like dgidb-related functionality only requires a URL update to get it working again.
Sure, just need to get my laptop to look it up.
Ok, so you'd use:
dgidb_url = 'http://www.dgidb.org/api/v2/interactions.json?genes='
to re-enable things, at least, that's the minimal change.
What I'd prefer though would be if URLs (the dgidb, the install-data url, others?) became part of the config file. This would give, e.g., Galaxy Admins a single place to locate these URLs and update them if they break at some point. I realize that is a slightly bigger change, but if you have a little bit of time before the release, I think it would help a lot with long-term support of any gemini install.
@bgruening we've been talking about this before. The above line is the patch to
Line 28 in 8a3e757
that you'd have to apply to gemini on usegalaxy.eu to make the actionable mutations and the --dgidb option of the query tool work again.
great. thank you!
so the dgidb stuff is getting used? If so, I can keep it as long as it's this simple URL change to fix.
@bgruening it's working ๐
@brentp yeah, we have an ongoing project with clinicians, for whom being able to get at gene-drug interactions is a major selling point. So I'd rather like to see more such functionality than less.
When you update gnomad, will you include the subcohorts (i.e. frequencies from the controls-only and non-neuro cohorts)?
@jxchong are these the only additions from gnomad? I'll put a list of what I plan to add for final verification, but an initial set would be helpful.
Any possibility of updating CADD to the new v1.4?
I have added the CADD update to the list.
All, I have implemented the comp-het + denovo (CHDN) in a specific way and I'm looking for problems I might not have foreseen. Currently, the highest priority is 1. I have made a "good" CHDN candidate have a priority of 1.5 so one can filter with that (I will change the gemini parameter to accept floats).
A good candidate for kid, mom, dad would be:
# het inherited from mom:
A/T, A/T, A/A
# DN
C/T, C/C, C/C
A somewhat arbitrary requirement that I have added is that the DN HET can not occur in any unaffected sample in the family. So, if we add an unaffected sib to the above trio so it's
kid, mom, dad, sib and the sib also has the de novo:
# het inherited from mom only in proband:
A/T, A/T, A/A, A/A
# DN in both kids
C/T, C/C, C/C, C/T
this is not reported as a candidate since an unaffected (sib) shares the DN. This could miss cases where it was a germ-line mosaic, but I think it will also remove a lot of false-positives.
Any thoughts on this approach? I want to avoid extra flags and arguments and just provide a 95% solution here.
Another possible addition is gnomad exomes update to 2.1:
http://gnomad.broadinstitute.org/downloads
Update -- never mind, just saw you already mentioned gnomad exomes in your first list.
@oleraj that's part of the gnomad item above. I'm testing that now. The file is much larger so I'm trying to cull it a bit.
Will this be a separate genetic model, or will it be part of the existing comp_hets? The approach seems fine to me for now (others should weigh in too).
it will be part of comp-het and not require any change (other than bumping the max-priority to > 1.5
@brentp Approach seems fine to me as well. If you are adding CHDN, it'd be nice to add simple multi-gen support to the de_novo model as requested in #885 (i.e. de novo in generation 1, and passed down as autosomal dominant in child in generation 2) as de novo->AD is more common than CHDN. I think it'd be relatively simple to add without requiring a ton of flags because I think you could simply change the requirements such that in a given family, the variant must be de novo at least once (two unaffected parents who are not carriers and their carrier child) while allowing for affected offspring of that child to be het.
Is anyone against removing ESP? Given it's size relative to newer resources, it would drop a few columns...
Hm, we'd like to see ESP still be included in max_aaf_all (not all ESP samples can be incorporated into gnomAD and if something is relatively common in ESP because of those samples, that's useful info) but don't need to actually output the database-specific frequency.
I wasn't aware of the cancer samples in gnomad. Just read about this in a previous faq here. Yes, probably would be good to include the non-cancer control set.
ok. I have added that. I am building (and re-building) dbsnp and CADD.
Once those are done, my planned development is done. I will not update the gene_detailed and gene_summary stuff.
If there are any other high-priority issues, please let me know in the next couple of days.
I may be able to tackle the gene_summary and gene_detailed table updates, but I will only be able to do it during next weekend.
@pfpjs that would be great! I will see how things look and may make a release before then and make a point release to incorporate your updates and any changes needed after everyone gets to try the new version.
I have updated the inheritance module so anyone interested can test the comp-het + de novo and the transmitted de novo stuff by doing: pip install -U inheritance
to get v0.1.5 before we make the gemini release.
Some issues from my wish list for potential inclusion:
As you're re-building CADD, is there any way you could include the CADD indels too, preferably merged with the SNPs? #833
Related to dbNSFP, would you be able to address #886 ? Basically, I think we can either modify entries in dbNSFP to convert =
to something else while re-building the database for gemini, or just remove this line from geneimpacts/effect.py
assert not "=" in effect_string
We ran into a few errors in running tests last time we tried to install from master, documented in #901, in case there are any quick fixes you might notice. Some of them seem pretty trivial, e.g., gemini/test/unicode.vcf.gz was missing -- probably inadvertently deleted from the repository at some point.
We also had a discussion in #663 around prioritizing canonical transcripts, which seems to be implemented in geneimpacts>=0.3.4. If there is a simple way to incorporate the latest version of geneimpacts and maybe add a --prioritize_canonical flag to gemini load, that would be helpful.
Oh! What about including CCRs as an annotation? :)
Agree; CCRs are in the larger plan, but will leave it to Brent if it will make it into this release.
I've managed to fix and update the generation of detailed_gene_table
and summary_gene_table
, can anyone else take a second look? Thanks!
Awesome! Paulo, thanks so much. I will have a look on Tuesday, then I think we are ready for the next release.
I think we'll let the dust settle from this release and then add CCRs if there is interest. It's simple to add CCRs via gemini annotate (or using the new loader) but I get that it would be convenient to have it as one of the install datasets.
Late to the party but I look forward to the new release!
Will some aspects of the update (non database-related) also remain functional with the VCFAnno + VCF2DB pipeline? Specifically thinking about gene_summary and gene_detailed, and if there's a way to leverage these improvements within my current BCFTools (soft-filter), SNPeff, VCFAnno, VCF2DB pipeline.
Or do you now suggest switching back to the gemini load + gemini annotate to get extra features (e.g. in-house database).
Thanks!
Phil
@Phillip-a-richmond you should continue using the new loader. this will be the last major release using the old loader. If there are features missing when using the new loader, then PR's would be great.
dgidb seems to be broken again. perhaps it is an intermittent server error but I don't plan to fix it. I will rely on those that use dgidb to keep that functioning.