R package: Clustering of Single Cell Sequencing Copy Number Profiles to Identify Clones
devtools::install_github("jysonganan/SCclust")
See vignettes. Development version.
This is an example of using SCclust package in R. SCclust implements determination of variable-length bin boundaries, computation of bin counts, GC normalization and segmentation on bin counts, determination of breakpoints, derivation of new feature set, hierarchical clustering on the binary matrix and identification of clone structure. This vignette aims to demonstrate the usage of SCclust through some example codes and example data.
Before the implementation of SCclust, there are some preprocessing steps: download reference genome, run four python programs and index/mapping commands.
Download and unzip the reference genome files. The reference genome files are saved in a directory /filepath/chromFa
.
Take hg19 and hg38 as examples:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz
cd filepath/ChromFa
python hg19.chrY.psr.py(hg38.chrY.psr.py)
bash bowtie.build.bash
parameters: read length, e.g. 100; genome, e.g. hg or hgdm
mkdir k100
python generate.reads.py 100 hg| /filepath/bowtie-0.12.7/bowtie -S -t -v 0 -m 1 -f hg - |
python mappable.regions.py > /k100/mappable.regions.txt 2> /k100/mappable.regions.stderr &
python chrom.sizes.py k100 hg
###(3) Prepare SAM files for all cells in one directory
/filepath/cellSAMdir
Directory for SAM files of all single cells, e.g. example.rmdup.sam
Make sure that you have installed the bioconductor package DNAcopy first.
install.packages("SCclust_0.1.4.tar.gz", repos = NULL, type="source")
Specify some work directories. chromFa_dir
: the directory for reference genome; k100_dir
: the directory for mappable regions.
library(SCclust)
chromFa_dir <- "/filepath/ChromFa"
k100_dir <- "/filepath/ChromFa/k100"
The count of bins can be set as any positive integer such as 5000, 10000, 20000.
bin_boundaries(k100_dir, bincount = 20000)
varbinGC(chromFa_dir, k100_dir, Nk = "20k")
SAM_dir <- "/filepath/cellSAMdir"
cellname <-"example"
bin_counts(SAM_dir,k100_dir, cellname, Nk = "20k")
bin_mat <- read.table("/filepath/cellSAMdir/example.varbin.20k.txt", header = T, sep = "\t")
gc <- read.table("/filepath/k100_dir/varbin.gc.content.20k.txt”, header = T, sep = “\t”)
One example dataset is included in the SCclust package.
bin_mat <- example1.varbin.20k
gc <- hg19.varbin.gc.20k
Given the bin count of one cell: bin_mat
and the GC content table gc
, then do GC normalization and segmentation. The package can also generate plots here: the upper plot is GC-normalized and ratio-normalized bin counts; the other plot is integer CN using "multiplier" method.
bin_mat_normalized <- gc_one(bin_mat,gc)
bin_mat_segmented <- cbs.segment_one(bin_mat_normalized, alpha = 0.05, nperm = 1000, undo.SD = 1.0, min.width = 5, method = "multiplier", genome = "hg" , graphic = TRUE)
segfile <- cbs.segment_all(SAM_dir, Nk = "20k", gc, alpha = 0.05, nperm = 1000, undo.SD = 1.0, min.width = 5, method = "multiplier", genome = "hg")
seg.quantal <- segfile$seg.quantal
ratio.quantal <- segfile$ratio.quantal
Determine the position of breakpoints and the sign of CN discontinuity. Output two tables breakpoint_table
and ploidies_table
.
Some bad cells can be deleted. Here we suppose CJA1024 and CJA1025 are the names of bad cells.
res1 <- preprocess_segfile(seg.quantal, gc, eviltwins = c("CJA1024", "CJA1025"), ploidies = TRUE)
breakpoint_table <- res1$breakpoint_table
ploidies_table <- res1$ploidies_table
The breakpoints are extended to an interval spanning 2*smear
+1 bins. The centromere areas are also filtered out. This results in the smear_table
.
The function findpins
implement the procedure to find a minimum set of "piercing points" for the extended intervals as the new feature set and derive the incidence table. The piercing points are a smallest set of points such that each interval contains at least one of them. pins
and pinmat
provide the bin location of new feature set and the incidence table. We extract cell_names
for stages later.
smear_table <- findsmears(breakpoint_table, smear = 1, keepboundaries = FALSE, mask_XY = TRUE)
res2 <- findpins(breakpoint_table, smear_table)
pins <- res2$pins
pinmat <- res2$pinmat
cell_names <- res2$cell_names
We perform Fisher's tests for pairwise dissimilarity on both observed incidence table and permutated incidence tables. This procedure simFisher_parallel
generates two vectors of p-values true_fisherPV
and sim_fisherPV
for observed and permutated data respectively.
res3 <- simFisher_parallel(pins, pinmat, sim_round = 500)
true_fisherPV <- res3$true_fisherPV
sim_fisherPV <- res3$sim_fisherPV
Compute the FDRs mat_fdr
for the observed Fisher's test p-values. Also output the dissimilarity matrix mat_dist
based on pairwise Fisher's test p-values.
res4 <- fdr_fisherPV(true_fisherPV, sim_fisherPV, cell_names, lm_max = 0.001, graphic = TRUE)
mat_fdr <- res4$mat_fdr
mat_dist <- res4$mat_dist
This procedure identify the hard and soft clones and display the hierarchical tree. The clones node information can be output from hc_clone$softclones
.
hc <- hclust_tree(pinmat, mat_fdr, mat_dist, hc_method = "average")
hc_clone <- find_clone(hc, fdr_thresh = -2, share_min = 0.85, n_share = 3, bymax = TRUE,
climb_from_size = 2, climb_to_share = 3, graphic = TRUE)
sub_hc_clone <- find_subclone(hc_clone, pinmat, pins, min_node_size = 6, sim_round = 500,
lm_max = 0.001, hc_method = "average",base_share = 3, fdr_thresh = -2, share_min = 0.90,
bymax = TRUE, climb_from_size = 2, climb_to_share = 3, graphic = TRUE)
Create an output directory /filepath/viewerInput
for the output files. The directory will also be the input directory for the viewer.
output_viewer(output_file_dir = "/filepath/viewerInput", seg.quantal, ratio.quantal, pins,
pinmat, mat_dist, hc_clone, sub_hc_clone, subcloneTooBig = 0.8, smear = 1, study="GL9.2")
Details for arguments and functions can be found by typing e.g. help(package="SCclust")
, ?fdr_fisherPV
.