bystrogenomics/bystro

Sprint 13 Task List

cristinaetrv opened this issue · 1 comments

Proteomics:
Goal: Wrap up proteomics methods

  • Take data cleaning analysis w/ preprocess step that separates out somascan and TMT, see effect of domain adaptation before combining vs combined @akotlar 6/11/2024
  • Filtering needs to be generalized to SomaScan @akotlar 6/14/2024
  • Harmonize SomaScan/TMT datasets - latent variable model with two sets of covariates, do imputation on each, harmonization minimizing the discrepancy @austinTalbot7241993 6/14/2024
  • Demonstrate network analysis on ~300 sample dataset @akotlar 6/14/2024

GIN 6/17 work:

PRS

  • PR Dave's version of PRS @akotlar 6/11/2024
  • Ask Thomas about: Genotyping that Emory is ingesting on Emory samples, Illumina 650K Array (most recent, cheapest array), will require imputation (topmed, etc) @cristinaetrv 6/12/2024
  • (backlog) Complete Citi training (Alex & Cristina) and email Paula/Petek (Cristina) for access to the CHOP data. - @cristinaetrv 6/21/2024
  • Test imputation method for PRS @austinTalbot7241993 - 6/27/2024
  • Compare our imputation to Minimac4 @austinTalbot7241993 - 6/21/2024
  • (stretch) Write imputation method in C @austinTalbot7241993 - 6/27/2024
  • Take in ancestry PCs as PRS-CS covariates - @akotlar - 6/27/2024
  • Take in GWAS summary statistics as PRS-CS covariates - @austinTalbot7241993 - 6/27/2024
  • Finish v1 PRS integration - 2024-06-13 - @akotlar
  • Display basic PRS results in webapp (table with individuals and their score) - @akotlar 6/17/2024
  • Document design choices for PRS allele frequency weighting - @cristinaetrv - 6/14/2024
  • Weigh PRS scores by gnomad allele frequencies for specific ancestries and the corresponding ancestry probability - @cristinaetrv 6/12/2024
  • Take in top hit from ancestry, convert to superpop (for allele freq only), connect to LD map for corresponding pop for LD clump - @cristinaetrv 6/14/2024
  • Take in 5 gnomad superpop AFs in chunks (100k or less) of thresholded score loci converted to query format using query library on annotation for target dataset - @cristinaetrv 6/21/2024
  • Research remaining LD maps - @cristinaetrv 6/26/2024
  • Add remaining LD maps if they're easy to find - @cristinaetrv 6/26/2024
  • Liftover LD maps if they're easy to find - @cristinaetrv 6/28/2024
  • Get harmonized AD summary stats sanitized - @cristinaetrv 6/28/2024
  • (stretch) Liftover harmonized AD summary stats @cristinaetrv
  • Fix clump by pval - @cristinaetrv 6/28/2024
  • v2 PRS integration - 2024-06-28 - @akotlar - scope needs to be defined, but minimally need to allow uploading covariates, and similar perf to Dave's work at least
  • (sprint 14) Follow up with gates about GWAS summary statistics and what we can include with our platform
  • (sprint 14) Experiment management is back in and integrated with search / APIs so that we can pull covariates/traits

Covariance Matrix Estimation/ML library
Goal: Hand off POE method to Mike by end of sprint

  • Make more computational and alternative hypothesis tests for Ilha to benchmark @austinTalbot7241993 6/27/2024
  • Updates to loss functions - @IlhaH 6/27/2024
  • Computational benchmarking (compared to POIROT) - @IlhaH 6/27/2024

Platform

  • Per-sample data management v1 - 2024-06-28 - @akotlar
  • Basic LLM demo - 2024-06-28 - @akotlar
  • (stretch) [ ] Bystro Annotator AMI is fully restartable

Documentation

  • Separate out annotator description/perl side including performance figures, describe every piece that repo has including Machine Learning subsection, Bioinformatics tools subsection (installation first) - 6/27/2024
  • GIF of how you would use general purpose ML library - 6/27/2024

6/12/2024

Proteomics:

  • Alex met with Erik Dammer, and Erik will send more information about which files are the ones we should be analyzing
  • Erik hadn't normalized within batch in dataset that Alex had been using because they were comparing tissues types and looking at total abundance numbers, but Erik will provide name of dataset that was used for network analysis. Instead, two types of data (soma and TMT) were considered as 'batches' so they are normalized by platform.

POE:

  • Test is anti-conservative, but can use a bootstrap approach and see what coefficient estimates are and see which ones have a POE
  • Getting benchmarks on speed