/bds_hackathon

Primary LanguageJupyter Notebook

BDS Hackathon

Schedule

Date/Time: Nov 7th 9am - Nov 8th (midnight)

Room: MR5 3005

Datapalooza: Nov 9th-10th (Th-Fr)

Overview

Single-cell RNA-seq (scRNA-seq) has now become routine, and there are hundreds of published datasets of single-cell RNA-seq data in various biological systems. One of the key applications of scRNA-seq is to separate cell types. For example, studies may identify novel cell types that previously have been hidden due to population averaging; or, some studies use data to classify individual cells as cancer cells or normal cells, leading to cleaner expression profiles.

A typical analysis uses an unsupervised clustering (like t-SNE, PCA, or MDS) to visualize the high-dimensional expression data and identify clusters of individual cells. These clusters are then annotated post-hoc based on their gene expression patterns. Most effort has been in unsupervised analysis of scRNA-seq data, and the features that define the differences in classification are then derived from the factors used in the cluster, but a supervised approach could provide a better way to identify salient features.

Therefore, we are now interested in applying supervised machine learning methods to build models that can classify individual cells into cell types. These models will be useful for (at least) two potential downstream applications:

  1. They could span multiple data sets and thereby build a pan-cell-type predictor that could be fed new scRNA-seq data from a new experiment and be used to classify known cell types.

  2. They will provide a novel look at the feature set that defines a cell type, which is not revealed by the unsupervised methods.

We should seek to build a reproducible piece of software that will enable others with scRNA-seq data to either re-run our analysis to build a predictor for the new dataset, or to use the predictor we have built to classify newly sequenced single cells.

Compute organization

Each of you should be a member of the bds_tg group on Rivanna. We have an allocation of disk space on Rivanna at /sfs/lustre/allocations/bds_tg. I suggest everyone set an environment variable to point to this for easy communication:

export BDSDATA="/sfs/lustre/allocations/bds_tg"

You should also have access to a compute credit allocation, also called bds_tg (use allocations to see yours).

Please commit any code into this repository.

Data

Resources

Some planning ideas (from aakrosh):

Maching learning links:

  • http://h2o.ai - machine learning library (with R and python bindings)
  • Caret R package - Universal R interface to various tools.
  • mlr R package - Universal R interface to various tools.
  • R tensorflow
  • TensorFlow - Google's deep learning library
  • Hu et al. 2016 - BMC Genomics paper on supervised classification of single-cell RNA. A machine learning approach for the identification of key markers involved in brain development from single-cell transcriptomic data.

scRNA analysis:

scRNA data:

Notes on approaches taken:

  • In two scRNA-seq cancer studies, groups identified tumor malignancy/state using CNV analysis of scRNA-seq data (average of normalized expression levels over relatively large stretches of the genome). This tended to be followed by unsupervised clustering and expression of key marker genes to identify cell types. One group identified critical/disease distinguishing genes by their principal component loadings/correlation to PC1, PC2 or PC3. This last step is where supervised machine learning approaches could do a better job.