arq5x/gemini

release prep

brentp opened this issue ยท 46 comments

It's time to make a new release this issue will track what needs to be done. Please add any essential (and only essential) updates as comments to this issue:

  • update clinvar
  • update CADD to v1.4
  • update dbsnp
  • update gnomad and controls and non-neuro control AFs (gnomad 2.1)
  • update dgidb url
  • for x-linked recessive exclude sites where the male is het and parents are hom ref. (see #903)
  • #843 (don't report x-linked DN's when one parent is missing).
  • #902 update gene_summary and gene_detailed

presumably issue #903 is included as well as issue #843

When you update gnomad, will you include the subcohorts (i.e. frequencies from the controls-only and non-neuro cohorts)?

I updated the list to include those items. thanks for the reminders

pfpjs commented

Taking de novos into account in the comp_het inheritance model would be nice!

arq5x commented

I think it would be great to add support for de novos when reporting candidate CHs. The challenge is that one cannot be confident that it causes a true CH without read-backed phasing in the common case of trios. As such, this would have to either be part of the relaxed priority CH searches or a specific mode.

pfpjs commented

Yeah, the variant could be on the same allele as the inherited one and thus not be a real comp_het, but perhaps a new lenient mode, or a flag for these specific cases would help. Also, this applies for diploid organisms (human specifically, in my case), for other ploidies, things could get ugly fast.

Agree it'd be nice. I've used this hack to get at the results:

Run comp_hets --max-priority 2 which will give you all potential CH pairs including those that are de_novo + 2nd allele. This yields a lot of false positives, but you can winnow down the results by cross-referencing the gene list with the results of de_novo and looking for overlapping genes.

pfpjs commented

@jxchong, thank you for that hint!

I'll see what I can do about comp-het + DN. Though it can't be phased, it could be a special category.

@brentp #911 and #910 are two regressions from the 0.20 version.

wm75 commented

Looks like dgidb-related functionality only requires a URL update to get it working again.

@wm75 can you make a PR or let me know the URL? I'm not familiar with dgidb.

wm75 commented

Sure, just need to get my laptop to look it up.

wm75 commented

Ok, so you'd use:

dgidb_url = 'http://www.dgidb.org/api/v2/interactions.json?genes='

to re-enable things, at least, that's the minimal change.

What I'd prefer though would be if URLs (the dgidb, the install-data url, others?) became part of the config file. This would give, e.g., Galaxy Admins a single place to locate these URLs and update them if they break at some point. I realize that is a slightly bigger change, but if you have a little bit of time before the release, I think it would help a lot with long-term support of any gemini install.

wm75 commented

@bgruening we've been talking about this before. The above line is the patch to

dgidb_url = 'http://dgidb.genome.wustl.edu/api/v1/interactions.json?genes='

that you'd have to apply to gemini on usegalaxy.eu to make the actionable mutations and the --dgidb option of the query tool work again.

great. thank you!
so the dgidb stuff is getting used? If so, I can keep it as long as it's this simple URL change to fix.

@wm75 fixed on our server.

wm75 commented

@bgruening it's working ๐Ÿ˜„

@brentp yeah, we have an ongoing project with clinicians, for whom being able to get at gene-drug interactions is a major selling point. So I'd rather like to see more such functionality than less.

When you update gnomad, will you include the subcohorts (i.e. frequencies from the controls-only and non-neuro cohorts)?

@jxchong are these the only additions from gnomad? I'll put a list of what I plan to add for final verification, but an initial set would be helpful.

pfpjs commented

Any possibility of updating CADD to the new v1.4?

I have added the CADD update to the list.

All, I have implemented the comp-het + denovo (CHDN) in a specific way and I'm looking for problems I might not have foreseen. Currently, the highest priority is 1. I have made a "good" CHDN candidate have a priority of 1.5 so one can filter with that (I will change the gemini parameter to accept floats).

A good candidate for kid, mom, dad would be:

# het inherited from mom:
A/T, A/T, A/A
# DN
C/T, C/C, C/C

A somewhat arbitrary requirement that I have added is that the DN HET can not occur in any unaffected sample in the family. So, if we add an unaffected sib to the above trio so it's
kid, mom, dad, sib and the sib also has the de novo:

# het inherited from mom only in proband:
A/T, A/T, A/A, A/A
# DN in both kids
C/T, C/C, C/C, C/T 

this is not reported as a candidate since an unaffected (sib) shares the DN. This could miss cases where it was a germ-line mosaic, but I think it will also remove a lot of false-positives.

Any thoughts on this approach? I want to avoid extra flags and arguments and just provide a 95% solution here.

Another possible addition is gnomad exomes update to 2.1:
http://gnomad.broadinstitute.org/downloads

Update -- never mind, just saw you already mentioned gnomad exomes in your first list.

@oleraj that's part of the gnomad item above. I'm testing that now. The file is much larger so I'm trying to cull it a bit.

Will this be a separate genetic model, or will it be part of the existing comp_hets? The approach seems fine to me for now (others should weigh in too).

it will be part of comp-het and not require any change (other than bumping the max-priority to > 1.5

@brentp Approach seems fine to me as well. If you are adding CHDN, it'd be nice to add simple multi-gen support to the de_novo model as requested in #885 (i.e. de novo in generation 1, and passed down as autosomal dominant in child in generation 2) as de novo->AD is more common than CHDN. I think it'd be relatively simple to add without requiring a ton of flags because I think you could simply change the requirements such that in a given family, the variant must be de novo at least once (two unaffected parents who are not carriers and their carrier child) while allowing for affected offspring of that child to be het.

@jxchong would you have a look at: 513f53f ? and verify it meets your needs for the non-neuro and controls AFs? I just took those 2 fields from the VCF and add them to the database.

@brentp commit 513f53f looks good. Do you think others might have a use for the non-cancer control set too? Perhaps yes, especially when analyzing mosaic variants? @oleraj

Is anyone against removing ESP? Given it's size relative to newer resources, it would drop a few columns...

Hm, we'd like to see ESP still be included in max_aaf_all (not all ESP samples can be incorporated into gnomAD and if something is relatively common in ESP because of those samples, that's useful info) but don't need to actually output the database-specific frequency.

I wasn't aware of the cancer samples in gnomad. Just read about this in a previous faq here. Yes, probably would be good to include the non-cancer control set.

ok. I have added that. I am building (and re-building) dbsnp and CADD.
Once those are done, my planned development is done. I will not update the gene_detailed and gene_summary stuff.
If there are any other high-priority issues, please let me know in the next couple of days.

pfpjs commented

I may be able to tackle the gene_summary and gene_detailed table updates, but I will only be able to do it during next weekend.

@pfpjs that would be great! I will see how things look and may make a release before then and make a point release to incorporate your updates and any changes needed after everyone gets to try the new version.

I have updated the inheritance module so anyone interested can test the comp-het + de novo and the transmitted de novo stuff by doing: pip install -U inheritance to get v0.1.5 before we make the gemini release.

Some issues from my wish list for potential inclusion:

As you're re-building CADD, is there any way you could include the CADD indels too, preferably merged with the SNPs? #833

Related to dbNSFP, would you be able to address #886 ? Basically, I think we can either modify entries in dbNSFP to convert = to something else while re-building the database for gemini, or just remove this line from geneimpacts/effect.py

assert not "=" in effect_string

We ran into a few errors in running tests last time we tried to install from master, documented in #901, in case there are any quick fixes you might notice. Some of them seem pretty trivial, e.g., gemini/test/unicode.vcf.gz was missing -- probably inadvertently deleted from the repository at some point.

We also had a discussion in #663 around prioritizing canonical transcripts, which seems to be implemented in geneimpacts>=0.3.4. If there is a simple way to incorporate the latest version of geneimpacts and maybe add a --prioritize_canonical flag to gemini load, that would be helpful.

@oleraj I have responded on a few of those issues and #901 is fixed.
I will incorporate the latest version of geneimpacts when making the release. For others, I'd prefer those that need that functionality to contribute as there are relatively simple work-arounds for those.

Great, thanks @brentp.

Oh! What about including CCRs as an annotation? :)

arq5x commented

Agree; CCRs are in the larger plan, but will leave it to Brent if it will make it into this release.

pfpjs commented

I've managed to fix and update the generation of detailed_gene_table and summary_gene_table, can anyone else take a second look? Thanks!

Awesome! Paulo, thanks so much. I will have a look on Tuesday, then I think we are ready for the next release.

I think we'll let the dust settle from this release and then add CCRs if there is interest. It's simple to add CCRs via gemini annotate (or using the new loader) but I get that it would be convenient to have it as one of the install datasets.

Late to the party but I look forward to the new release!

Will some aspects of the update (non database-related) also remain functional with the VCFAnno + VCF2DB pipeline? Specifically thinking about gene_summary and gene_detailed, and if there's a way to leverage these improvements within my current BCFTools (soft-filter), SNPeff, VCFAnno, VCF2DB pipeline.

Or do you now suggest switching back to the gemini load + gemini annotate to get extra features (e.g. in-house database).

Thanks!
Phil

@Phillip-a-richmond you should continue using the new loader. this will be the last major release using the old loader. If there are features missing when using the new loader, then PR's would be great.

dgidb seems to be broken again. perhaps it is an intermittent server error but I don't plan to fix it. I will rely on those that use dgidb to keep that functioning.