In general, the Troyanskaya lab works on functional genomics: the process of analyzing and integrating large-scale experimental datasets in order to understand gene function, the interactions between genes and proteins as they carry out cellular tasks, and the systems-level extensions of these relationships. A basic example might be function assignment, the task of assigning a biological role to an uncharacterized gene/protein (e.g. "We predict that the gene XYZ1 participates in the process of amino acid synthesis.")
More sophisticated functional analyses often involve networks of functional relationships, graphs in which nodes represent genes and edges represent some type of interactions between these genes. These interactions might be physical (direct binding, complexing, protein modifications), genetic (synthetic lethality), regulatory (transcription factors and their targets), or even just indicate participation in the same general pathways (e.g. two enzymes in different parts of the glycolysis pathway).
Part of what allows us to make these fairly high-level predictions is integrating very large amounts of data. Even a single genome-scale dataset such as a microarray can provide a surprising amount of functional information, but high-throughput data tends to be noisy and difficult to interpret in isolation. By mining dozens or hundreds of datasets simultaneously, though, we're able to more easily determine which signals represent real biology and which are just noise. This integration requires both the ability to collect and normalize very large amounts of data and the ability to generate "gold standards" of known functionally related genes. Both aspects of this process - data analysis and gold standard generation - have their own collections of subtleties and pitfalls.
To get a flavor for these tasks, some good papers to start with include:
- Understanding multicellular function and disease with human tissue-specific networks, Nature Genetics
- Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder, Nature Neuroscience
- Predicting effects of noncoding variants with deep learning–based sequence model, Nature Methods
- Targeted exploration and analysis of large cross-platform human transcriptomic compendia, Nature Methods
Additionally, the introduction section of Curtis Huttenhower's Ph.D. thesis is an excellent (and thorough) primer on functional genomics and related algorithms.
Our weekly lab meeting consists of:
- A slightly more formal one hour presentation (typically with slides, a general project intro, etc.) with group discussion. The rotation schedule is pinned to #functional on Slack.
- A 60 minute mini-group meeting where lab members can get feedback on projects (typically quick updates on project progress, challenges, etc.). See below for guidelines.
The lab meeting location alternates between Princeton and Flatiron (NYC). Meeting times and locations will be updated on the FunctionLab Google Calendar.
- Before group meeting, all trainees (grad students & post-docs) will post no more than 4 bullet points on Slack describing what they’ve worked on since the last meeting.
- No paragraphs, no more than 4 bullets!
- Feedback is not expected on these posts, but if you have something to mention feel free to reply in a thread directly to the post
- Mini-group time is only for items you need feedback on from the group. Please avoid lengthy lists of things you did that week.
- A typical update should only be a few minutes
- If you need more group feedback, time will be capped at 10min per person
- If you only need Olga’s feedback, make sure you have a 1-on-1 meeting scheduled
- Group meeting will be capped at 2 hours. Lab members who didn’t get a chance to speak will be able to the following week.
Journal clubs / primers are scheduled separately. Please check the calendar when the next one is scheduled.
These meetings are a chance for people to present on either a recent paper or a topic of interest, either biological or technical (e.g., single-cell sequencing, kidney disease, deep learning, coding standards, new packages that may be of use to others).
Everyone is expected to lead discussion at some point so please check with one of the contacts when you can present.
Contacts: Aviya, Ksenia, Tess and Yun
These 15-minute individual meetings are typically on the same day as lab meeting. Usually, Princeton folks have their meetings when lab meeting is at Flatiron and vice versa for Simons folks. Meeting details will be updated on the FunctionLab Google Calendar.
Princeton genomics machine information can be found on the FunctionLab GitHub wiki.
Flatiron/Simons server information can be found on the internal wiki.
New members should create accounts and be added to the FunctionLab group for the following services:
Contacts: Alicja , Rachel, Natalie
If you don't have a Genomics (LSI) account (typically if you are from CS), you will need to get one if you plan to use genomics resources. Fill out this form:
https://frevvo-prod.princeton.edu/frevvo/web/tn/pu.nplc/u/fa933f95-a5f1-4acc-988a-389be9c87b5d/app/_MS6fUDoEEemXbYTKx6ZLAw/formtype/_EyExYK-zEemF2ZSr_5kKrQ/popupform
You don't need to it again if you already have an account from a past rotation. This account will allow you LSI genomics cluster (argo
)
If you need access to functionlab servers, then you need to be added separately.
If you have questions regarding the form or if you need access to any functionlab servers, please touch base with Alicja, Slack: atadych
We have a mailing list for lab communication. To subcribe, visit this site and fill in your name and email address with a password (note that the password is sometimes sent out as a plaintext email reminder, so don't use anything you use for other important things).
The Troyanskaya lab makes extensive use of the following technologies. New lab members should familiarize themselves with the tools that are most relevant to their projects.
- Programming languages
- R
- Python
- C++
- Javascript
- Source control
- Bayesian Data Integration
- Our C++ library: Sleipnir
- Data exploration and visualization
- IDE for R: Rstudio
- R web applications: Shiny
- Interactive Python: IPython/Jupyter
- Web development
- Backend
- Framework: Django
- REST API: Django REST Framework
- Database: MySQL
- Search
- Frontend
- Backend
- Deployment
- Magical: A hierarchical Bayesian approach that leverages paired scRNA-seq and scATAC-seq data from different conditions to map disease-associated transcription factors, chromatin sites, and genes as regulatory circuits. Resolving chromatin remodeling-linked gene expression changes at cell type resolution is important for understanding disease states. By simultaneously modeling signal variation across cells and conditions in both omics data types, MAGICAL achieves high accuracy on circuit inference.
- Point of Contact in Lab: Xi Chen (developer)
- Seqweaver: A deep learning-based algorithmic framework for predicting the RNA-binding protein dysregulation effects of sequence alterations with single nucleotide sensitivity.
- Point of Contact in Lab: Chris Park (developer), Tess Marvin (frequent user), Aviya Litman (frequent user)
- Expecto: a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks at tissue-specific level.
- Point of Contact in Lab: Ksenia Sokolova (frequent user)
- CLEVER: A modular deep-learning based framework for predicting cell-type-specific gene expression directly from DNA sequence.
- Point of Contact in Lab: Ksenia Sokolova (developer)
- Sei: A framework for integrating human genetics data with sequence information to discover the regulatory basis of traits and diseases. Sei learns a vocabulary of regulatory activities, called sequence classes, using a deep learning model that predicts 21,907 chromatin profiles across >1,300 cell lines and tissues. Sequence classes provide a global classification and quantification of sequence and variant effects based on diverse regulatory activities, such as cell type-specific enhancer functions.
- Point of Contact in Lab: Kathy Chen (developer)
- DeepSEA: A deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants.
- Beluga: The 2019 version of DeepSEA can predict 2002 chromatin features.
- DeepSEA - original: The original, 2015, version of DeepSEA that can predict 919 chromatin features.
- Point of Contact in Lab: Kathy Chen (developer of the related Sei), Ksenia Sokolova (uses Beluga in CLEVER), Chandra Theesfeld (frequent user)
- Variant Effect Prediction (VEP): The difference between the the predicted probability of the reference allele and the alternative allele for a regulatory feature (P𝑎𝑙𝑡−P𝑟𝑒𝑓).
- Tissue-Specific Functional Networks: These tissue-networks are constructed using a Bayesian probabilistic framework based on the prior biological information contained within a massive compendium of omics datasets (e.g., gene co-expression, transcription factor binding, and protein-protein interactions)
- GIANT Version 1
- GIANT Version 2
- Point of Contact in Lab: Zhicheng Pan, Aaron Wong (developer)
- Module detection: Community detection to find cohesive gene clusters from a provided gene list and a selected relevant tissue. Genes within a cluster share local network neighborhoods and together form a cohesive, specific functional module. Module detection enables systematic association of genes - even functionally uncharacterized genes - to specific processes and phenotypes represented in the detected modules. Functional modules are identified with tissue-specific functional networks, which predict gene interactions from massive data collections. Thus the discovered modules potentially capture higher-order tissue-specific function.
- Tapioca: A novel framework for predicting protein-protein interactions from thermal proximity coaggregation data. Created in collaboration with Cristea Lab at Princeton.
- Point of Contact in Lab: Tavis Reed (developer)