StemID and RaceID2 algorithms

RaceID2 is an advanced version of RaceID, an algorithm for the identification of rare and abundant cell types from single cell transcriptome data. The method is based on transcript counts obtained with unique molecular identifies.

StemID is an algorithm for the derivation of cell lineage trees based on RaceID2 results and predicts multipotent cell identites.

RaceID2 and StemID are written in the R computing language.

Methods

  • initialize. Creates a SCseq object.
    As input we need data frame of transcript counts, columns are cells, rows are genes. Run as:

    • sc <- SCseq(inputdata)
  • filterdata. Filters data.
    Input parameters and default values are:

    1. mintotal=1000 (discards cells with less than mintotal reads)
    2. minexpr=5, minnumber=1 (discards genes with less than minexpr transcripts in at least minnumber cells)
    3. maxexpr=Inf (discards genes with more than maxexpr transcripts in at least one cell)
    4. downsample=FALSE (logical; when TRUE data is downsampled to mintotal transcripts per cell, otherwise it is median normalized)
    5. dsn=1 (number of downsamplings; output is an average over dsn downsamplings)
    6. rseed=17000 (seed used for downsampling)
    7. dsversion="JCB" (downsampling function version)

    Input parameters are stored in slot sc@filterparameters. The method first median normalizes or downsamples (dependeing of downsample) transcripts across cells with more than mintotal transcripts and stores the result in slot sc@ndata. Then removes genes according to minexpr, minnumber and maxexpr and stores resulting data.frame into sc@fdata.

    • sc <- filterdata(sc, mintotal=1000, minexpr=5, minnumber=1, maxexpr=Inf, downsample=FALSE, dsn=1, rseed=17000, dsversion = 'JCB')
    • sc <- filterdata(sc) -- runs function with default values.
  • clustexp. Clusters data using kmedoids.
    Input parameters and default values are:

    1. clustnr=20 (Number of clusters. Must be greater than 1.)
    2. bootnr=50 (Maximum number of clusters for the computation of the gap statistics or the derivation of the cluster number by saturation criterion.)
    3. metric="pearson" (Metric to compute distance between cells. Options are: "spearman","pearson","kendall","euclidean","maximum","manhattan","canberra","binary","minkowski". Check function dist.gen for more information. Distances are stored in sc@distances.)
    4. do.gap=TRUE (If set to TRUE, the number of clusters is determined using gap statistics. Default is TRUE.)
    5. sat=FALSE (incorporated in RaceID2, computes the number of clusters using saturation criterion.)
    6. SE.method="Tibs2001SEmax" ()
    7. SE.factor=.25 ()
    8. B.gap=50 (Number of bootstrap runs for the gap statistics.)
    9. cln=0 (Number of clusters for clustering. In case it is 0, will be determined by either gap statistics of saturation criterion.)
    10. rseed=17000 (Seed for random number generator used in case of gap statistics and for posterior clustering.)
    11. FUNcluster="kmeans" (incorporated in RaceID2, this can be kmeans, hclust or kmedoids. )

Input parameters are stored in slot sc@clusterpar. Default is taken when no specified.
Data in sc@fdata in clustered using clustfun function. First, the distance bewteen cells is computed according to the metric with function dist.gen and stored in sc@distances as a matrix. Next, if required, the number of clusters is determined using either gap statistics or saturation criterion, using function clusGapExt. Finally, clustering is performed using function clusterboot from fpc R package. Output is sotred in sc@cluster and sc@fcol:

  • object@cluster$kpart: contains the cluster assignation of each cell before oultier detection (next step in analysis).
  • object@cluster$jaccard
  • object@cluster$gap
  • object@cluster$clb
  • object@fcol

Run as:

  • sc <- clustexp(sc, clustnr=20, bootnr=50, metric="pearson", do.gap=FALSE, sat=TRUE, SE.method="Tibs2001SEmax", SE.factor=0.25, B.gap=50, cln=0, rseed=17000, FUNcluster="kmedoids")
  • sc <- clustexp(sc) -- runs function with default values
  • findoutliers. Finds outliers.
    Input parameters and default values are:
  1. outminc=5 ()
  2. outlg=2 ()
  3. probthr=1e-3 ()
  4. thr=2**-(1:40) ()
  5. outdistquant=.95 ()
  6. version = 2 (equal to 1 or 2, depending on RaceID version)

hmmm Run as:

  • sc <- sc <- findoutliers(sc, outminc=5,outlg=2,probthr=1e-3,thr=2**-(1:40), outdistquant=.95, version = 2)
  • sc <- findoutliers(sc) -- runs function with default values
  • comptsne. Computes tSNE map.
    Input parameters and default values are:
  1. rseed=15555 (seed for random numbers)
  2. sammonmap=FALSE ()
  3. initial_cmd=TRUE ()
  4. others ()

hmmm Run as:

  • sc <- comptsne(sc, rseed = 1555, sammonmap = FALSE)
  • sc <- comptsne(sc)

Plots

  • clustheatmap.

  • plottsne.

Functions

  • downsample. Downsamples inputdata.
    Transcript data is converted to integer data and random sampling is done dsn times and averaged. A peudocount equal to 0.1 is added to the resulting data.frame. There are two versions (DG and JCB, written by Dominic Gr"un and Jean-Charles Boisset respectively). By default the functions uses JCB version. To choose another one use dsversion in method filterdata.

  • clustfun. Clusters sc@fdata.
    Version 2, from RaceID2. Computes distance between cells (using dist.gen function) using specified metric. Determines cluster number if required using gap statistics or saturation criterion. Then clusters data (using clusGapExt function) using the specified method -kmedoids, kmeans or hclust-.

  • dist.gen. Distance between cells.
    Computes and returns the distance matrix computed by using the specified distance (mmetric) measure to compute the distances between the cells. In case of metric "spearman", "pearson", or "kendall", the function takes 1 - correlation as a distance, and takes the direct measurement of the distance for metric "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski".

  • clusGapExt. Gap statistics and saturation criterion.

The following files are provided:

StemID/RaceID2 class definition: RaceID2_StemID_class.R StemID/RaceID2 sample code: RaceID2_StemID_sample.R StemID/RaceID2 reference manual: Reference_manual_RaceID2_StemID.pdf StemID/RaceID2 sample data: transcript_counts_intestine_5days_YFP.xls