Li Lin
EpiConv is a novel algorithm to cluster scATAC-seq data and detect differentially accessible (DE) peaks. We will demonstrate how to analyze scATAC-seq data by epiConv in this tutorial.
EpiConv is developed under the following environments:
- CentOS release 6.9 with perl 5.10.1
- bedtools v2.27.1
- MACS2
- R 3.5.1 with:
- Matrix package
- bigmemory package
- biganalytics package
- umap package
- PRIMME package
In order to run the scripts below, R packages densityClust
and heatmap.plus
are also needed. All source codes of epiConv can be found in the folder source/
. Here assume we put them into the folder ~/epiConv/
. Two source files need to be compiled.
g++ ~/epiConv/matrix_sum.c -o ~/epiConv/matrix_sum
g++ ~/epiConv/matrix_sampl.c -o ~/epiConv/matrix_sampl
We will use one dataset of PBMCs from 10X Genomics as example. Two files are required: Peak by cell matrix (filtered)
and Fragments (TSV)
. Here we put them into folder pbmc5k/
, extract the first file and rename the second file as pbmck5k_frag.bed.gz
.
There are two versions of epiConv: epiConv-full and epiConv-simp. EpiConv-full calculates the similarities between cells from raw Tn5 insertion profiles and epiConv-simp calculates the similarities from binary matrix. We first show the analysis pipeline for epiConv-simp. It is an implemention in R. First we read the source file of epiConv and the data:
source("~/epiConv/epiConv_functions.R")
mat<-readMM(file="pbmc5k/matrix.mtx")
barcode<-read.table(file="pbmc5k/barcodes.tsv",colClass="character")[,1]
peak<-read.table(file="pbmc5k/peaks.bed")
colnames(peak)<-c("seqnames","start","end")
rownames(mat)<-paste(peak$seqnames,":",
peak$start,"-",
peak$end,sep="")
colnames(mat)<-barcode
lib_size<-lib.estimate(mat)
freq<-freq.estimate(mat[,lib_size>1000])
res_epiConv<-create.epiconv(meta.features=data.frame(barcode=barcode[lib_size>1000],
lib_size=lib_size[lib_size>1000]))
res_epiConv<-add.mat(obj=res_epiConv,x=mat[freq!=0,lib_size>1000],name="peak")
We create an object res_epiConv
containing the raw data and results. Low quality cells (library size <1000) are removed.
create.epiConv
: create an epiConv object.meta.features
: the data.frame that contains meta features of single cells (e.g. barcodes and library size).- The meta features of cells can be obtained from the object using the form such as
res_epiConv$barcode
orres_epiConv$lib_size
.
add.mat
: add matrix to the epiConv object.obj
: the epiConv object.x
: the matrix.name
: the list name used to store the matrix.
Next, we normalize the matrix by TF-IDF transformation:
mat<-tfidf.norm(mat=res_epiConv@mat[["peak"]],lib_size=res_epiConv$lib_size)
infv<-inf.estimate(mat[,sample(1:ncol(mat),size=500)],
sample_size=0.125,nsim=30)
tfidf.norm
: perform the TF-IDF transformation.mat
: the matrix (row/peak and column/cell). It must be a Matrix object (created by readMM() or Matrix()).lib_size
: library size used in normalization. In the scripts above, we used the total number of accessible regions in each cell as library size.
inf.estimate
: learn a small value to replace infinite value in the analysis below.mat
: the matrix after TF-IDF transformation. We randomly sample a small fraction of cells from the full matrix to save the running time.sample_size
: the fraction of peaks used in each bootstrap.nsim
: the number of bootstraps.- The settings above are suitable for most data.
The similarities between single cells are calculated based on bootstrap approach. We randomly sample some peaks and calculate the similarites between single cells in each replicate and then average the results from replicates:
sample_size<-floor(nrow(mat)/8)
nsim<-30
sample_matrix<-lapply(1:nsim,function(x) sample(1:nrow(mat),size=sample_size))
Smat<-matrix(0,res_epiConv@ncell,res_epiConv@ncell)
for(i in sample_matrix){
Smat<-Smat+epiConv.matrix(mat=mat[i,],inf_replace=infv)
}
Smat<-Smat/nsim
res_epiConv<-add.similarity(obj=res_epiConv,x=Smat,name="sampl_simp")
In the script above, sample_matrix
contains the peaks in each replicate. Smat
contains the calculated similarities between single cells.
epiConv.matrix
: calculate the similarites.mat
: the matrix used to calculate the similarities.inf_replace
: the value used to replace log10(0). We can the value calculated above or an empirical value (e.g. -8).
add.simlarity
: add similarity matrix to the epiConv object.obj
: the epiConv object.x
: the similarity matrix.name
: the list name used to store the similarity matrix.- The similarity matrix can be obtained from the object using the form
res_epiConv[["sampl_simp"]]
.
Next we denoise the similarities between single cells:
Smat<-batch.blur(Smat=res_epiConv[["sampl_simp"]],
batch=NULL,
knn=50)
res_epiConv<-add.similarity(res_epiConv,x=Smat,name="sampl_simp_denoise")
batch.blur
: the function used to denoise the similarity matrix.Smat
: the similarity matrix for denoising.batch
: the function does not only denoise the data but also can remove batch effect. If you want to remove the batch effects, the batch information should be provided through this parameter (a numeric or character vector). If the data only contains cells from single batch, it should be set to NULL.knn
: the number of neighbors for each cell.
Finally we use umap to learn the low-dimensional embedding of the data:
umap_settings<-umap::umap.defaults
umap_settings$input<-"dist"
umap_settings$n_components<-2
umap_res<-umap::umap(max(Smat)-Smat,config=umap_settings)$layout
res_epiConv<-add.embedding(obj=res_epiConv,x=umap_res,name="sampl_simp_denoise")
The distance is calculated by max(Smat)-Smat
.
add.embedding
: add embedding to the epiConv object.obj
: the epiConv object.x
: the embedding matrix.name
: the list name used to store the embedding.
To prepare the input for epiConv-full, we save the list of high-quality barcodes and the epiConv object:
temp<-data.frame(res_epiConv@meta.features$barcode,1)
write.table(temp,file="pbmc5k/pbcm5k_ident.tsv",row.names=F,col.names=F,quote=F,sep="\t")
saveRDS(res_epiConv,file="pbmc5k/res_epiConv.rds")
Sometimes error such as problem too large
will occur when the dataset is large (e.g. >80,000 cells). In order to avoid such problem, Several functions of our algorithm have the bin
parameter with default value of 10,000. When the dataset contains more than 10,000 cells, the function will split the matrix into several parts, each containing 10,000 cells. Generally there is no need to change bin
, but if error still occurs, you can try small values (e.g. epiConv.matrix(mat=mat,inf_replace=infv,bin=5000)
). Large datasets also require more memory. Based on our test, epiConv requires 520GB memory for dataset with 81,173 cells and 436,206 peaks.
In order to accelerate the running speed, epiConv-full runs in bash shell but some common steps are performed in R. Its input is a gzip-compressed bed file named <prefix>_frag.bed.gz
(e.g pbmc5k/pbmc5k_frag.bed.gz
):
1st column | chromsome |
2nd column | starting site of the fragment |
3rd column | ending site of the fragment |
4th column | cell barcode |
An example file:
chr1 10073 10198 CCGCATTGTGTTTCTT-1
chr1 10085 10198 GAGACTTCAGCAACAG-1
chr1 10085 10302 GGGCCATAGAGATTAC-1
chr1 10126 10309 CCACGTTCAGTAGTCT-1
...
Other columns will be ignored. Generally this file will be provided by low level processing tools (e.g. cellranger from 10X Genomics).
Also we need to prepare a Tab-delimited file containing valid barocdes named <prefix>_ident.tsv
(e.g. pbmc5k/pbmc5k_ident.tsv
) as follows:
1st column | cell barcode |
2nd column | custom information (e.g batch, experimental condition, or 1 for all cells if there are not any information. But you can NOT use "NA" as epiConv will ignore all barcodes with "NA" identity.) |
An example file:
AAACGAAAGACACTTC-1 1
AAACGAAAGCATACCT-1 1
AAACGAAAGCGCGTTC-1 1
AAACGAAAGGAAGACA-1 1
AAACGAACAGGCATCC-1 1
...
Only valid barcodes will be processed.
First we use peak_calling.sh
to call high density regions of Tn5 insertions:
peak_calling.sh <prefix> <extsize> <fraction of data retained>
<prefix>
: the prefix of data.<extsize>
: the same parameter inMACS2 pileup
. For example, set<extsize>
to 100 will make MACS2 extend each insertion from both directions by 100bp before pileup insertions.<fraction of data retained>
: the fraction of data for downstream analysis. This parameter need not to be accurately specified but should be close to the fraction of fragments from nucleosome-free regions. For example, you can check the histogram of insertion length from 10X Genomics reporting summary to learn this parameter (the fraction of fragments in first peak in the histogram).
For the PBMC dataset, we use the following command:
~/epiConv/peak_calling.sh pbmc5k/pbmc5k 100 0.7
When it is finished (~4 hours), there will be a new file pbmc5k/pbmc5k_peak.bed
in bed format containing the peaks. In order to run epiConv in parallel, we need to split the job using the linux command split
:
split -a2 -d -l127000 pbmc5k/pbmc5k_peak.bed pbmc5k/pbmc5k_peak.run
Here we split the peak file into 10 jobs, each containing 127000 peaks. Like epiConv-simp, we need to randomly sample some peaks to perform bootstraps. We generate the sampling file using peak_sampl.sh
:
~/epiConv/peak_sampl.sh <peak file> <number of bootstraps> <fraction of peaks in each bootstrap> \
<random seed> > <prefix>_sampl.mtx
<peak file>
: the peak file generated in previous step.<number of bootstraps>
: number of bootstraps.<fraction of peaks in each bootstrap>
: fraction of peaks in each bootstrap.<random seed>
: random seed.peak_sampl.sh
will directly print results to the standard output. We write the results into<prefix>_sampl.mtx
.
In following steps, each job requires approximately (n cells)^2/2*(n bootstraps)*4/2^30 GB RAM (e.g. 1.4 GB RAM for 5,000 cells). In order to reduce the required memory when the dataset is large, we can adjust bootstrap settings. For example, we can set <number of bootstraps>
to 10 and increase <fraction of peaks in each bootstrap>
to 0.25 to make sure that each peak is sampled > 2x times. Actually the bootstrap step is aimed to reduce the noise for shallow-sequencing data (e.g. < 2,000 fragment for most cells). Based on our analysis, the results remain similar even without any bootstrap for most datasets (number of bootstrap=1; fraction of peaks=1). So reducing the number of bootstraps won't affect the results but can reduce the required memory.
For the PBMC dataset, we use the following command:
~/epiConv/peak_sampl.sh pbmc5k/pbmc5k_peak.bed 30 0.125 12345 >pbmc5k/pbmc5k_sampl.mtx
Then we use convolution.sh
to calculate the similarites between single cells:
convolution.sh <prefix> <suffix> <standard deviation of normal distribution>
<prefix>
: the prefix of data.<suffix>
: the suffix of the data, depends on thesplit
command called above (e.g. run00, run01, run02...)<standard deviation of normal distribution>
: affect the interactions between insertions. We think insertions from two cells with distance < 4σ may suggest a shared regulatory element.
For the PBMC dataset,we run convolution.sh
as follows:
~/epiConv/convolution.sh pbmc5k/pbmc5k run00 100
~/epiConv/convolution.sh pbmc5k/pbmc5k run01 100
~/epiConv/convolution.sh pbmc5k/pbmc5k run02 100
...
This step can be run in parallel to save the running time. convolution.sh
will automatically read the inputs. So all input files should be properly named as described above.
After running (each job requires ~4 hours), the script will generate two files: <prefix>_cmat.<suffix>
and <prefix>_sampled.<suffix>
. <prefix>_cmat.<suffix>
is a binary file contains the pairwise similarities bewtween cells for each peak. <prefix>_sampled.<suffix>
is a Tab-delimited file contains (ncells)*(ncells-1)/2 rows and nbootstraps columns, with each element containing the similarity between two cells in one bootstrap. An example for pbmc5k/pbmc5k_sampled.run00
:
10.5113 9.8516 18.8105 6.5908 7.2311 ......
3.9813 5.4448 1.8991 3.7418 4.0857 ......
6.8450 12.2343 1.5356 6.6680 2.7696 ......
0.8218 4.7227 11.9622 1.1415 8.0251 ......
27.5526 15.5030 29.7605 15.5724 20.6794 ......
......
In order to summarize the results from each small job, we need to sum the corressponding elements for each job using paste
and gawk
:
paste -d " " pbmc5k/pbmc5k_sampled.run?? |\
gawk -f ~/epiCOnv/run_merge.gawk ncol=30 >pbmc5k/pbmc5k_sampled.mat
In the script above, ncol
should be equal to number of columns in pbmc5k/pbmc5k_sampled.run??
. If you split the task into many jobs (e.g. 100), this step can also run in parallel as follows (assuming we split it into 100 jobs with suffix from run00 to run99):
paste -d " " pbmc5k/pbmc5k_sampled.run0? |\
gawk -f ~/epiConv/run_merge.gawk ncol=30 >pbmc5k/pbmc5k_sampled0.mat
paste -d " " pbmc5k/pbmc5k_sampled.run1? |\
gawk -f ~/epiConv/run_merge.gawk ncol=30 >pbmc5k/pbmc5k_sampled1.mat
.......
paste -d " " pbmc5k/pbmc5k_sampled.run9? |\
gawk -f ~/epiConv/run_merge.gawk ncol=30 >pbmc5k/pbmc5k_sampled9.mat
paste -d " " pbmc5k/pbmc5k_sampled?.mat |\
gawk -f ~/epiConv/run_merge.gawk ncol=30 >pbmc5k/pbmc5k_sampled.mat
Then, we summarize the results from bootstraps and transform the data into a ncell x ncell square matrix:
gawk -f ~/epiConv/rep_merge.gawk pbmc5k/pbmc5k_sampled.mat \
>pbmc5k/pbmc5k_smat.txt
An example for pbmc5k/pbmc5k_smat.txt
:
1.85311
1.64461,2.03432
1.79543,2.22283,1.82333
1.78524,2.26929,1.97568,2.18579
.......
Note that the first line is blank. Only lower triangle elements are stored. The matrix is not normalized yet. In order to normalize the matrix, we need to obtain the number of insertions falling into the peaks for each single cell:
paste pbmc5k/pbmc5k_lib.run* |\
gawk '{sum=0;for(j=1;j<=NF;j++) sum+=$j;print sum}' |\
paste pbmc5k/pbmc5k_ident.tsv - > pbmc5k/pbmc5k.info
In the script above, we calculated the library sizes of single cells and combined them with the identity file. pbmc5k/pbmc5k.info
shall be like this:
AAACGAAAGACACTTC-1 1 13154
AAACGAAAGCATACCT-1 1 22474
AAACGAAAGCGCGTTC-1 1 11189
AAACGAAAGGAAGACA-1 1 35276
AAACGAACAGGCATCC-1 1 17233
The 3rd column is the library size for single cells. After that, all steps in bash shell are finished. The following steps are performed in R. First we still need to read the scource file and data:
source("~/epiConv/epiConv_functions.R")
res_epiConv<-readRDS(file="pbmc5k/res_epiConv.rds")
cell_info<-read.table(file="pbmc5k/pbmc5k.info")
res_epiConv@meta.features<-cbind(res_epiConv@meta.features,lib_size_full=cell_info[,3])
Smat<-read.csv(file="pbmc5k/pbmc5k_smat.txt",col.names=res_epiConv$barcode,
header=F,fill=T,blank.lines.skip=F)
Smat<-as.matrix(Smat)
Smat[is.na(Smat)]<-0
Smat<-Smat+t(Smat)
Smat<-t(t(Smat)-log10(res_epiConv$lib_size_full))-log10(res_epiConv$lib_size_full)
res_epiConv<-add.similarity(res_epiConv,x=Smat,name="sampl_full")
In the script above, we read the similarity matrix and normalize it by library size of single cells (each element needs to be normalized by the library size of corresponding row and column). Then we blur the similarity matrix and perform low-dimensional embedding:
Smat<-batch.blur(Smat=res_epiConv[["sampl_full"]],
batch=NULL,
knn=50)
res_epiConv<-add.similarity(res_epiConv,x=Smat,name="sampl_full_denoise")
umap_settings<-umap::umap.defaults
umap_settings$input<-"dist"
umap_settings$n_components<-2
umap_res<-umap::umap(max(Smat)-Smat,config=umap_settings)$layout
res_epiConv<-add.embedding(res_epiConv,x=umap_res,name="sampl_full_denoise")
saveRDS(res_epiConv,file="pbmc5k/res_epiConv.rds")
Here we describe another denoising method based on Eigen value decomposition.
Smat<-res_epiConv[["sampl_full"]]
eigs<-dim.reduce(Smat=Smat,neigs=50)
umap_res<-umap::umap(eigs)$layout
res_epiConv<-add.embedding(res_epiConv,x=umap_res,name="sampl_full_decomp")
saveRDS(res_epiConv,file="pbmc5k/res_epiConv.rds")
dim.reduce
: calculate latent features from the similarity matrix.Smat
: the similarity matrix.neigs
: the number of latent features to calculate.
The results eigs
is a matrix with latent features(row/cell and column/feature). Features are orthogonal to each other and sorted by their Eigen values in decreasing order. After obtaining latent features, we calculate the Euclidean distance between cells and embed cells into low-dimension space. The result by Eigen value decompsition is similar with other methods that infer latent features (e.g cisTopic). However, it sometimes does not agree with the first denoising method as the data discarded can be noise or useful information. Generally, it is better to compare the results between two denoising methods when analyzing the data.
All figures plotted below can be found in figures.pdf
.
Before differential analysis, we cluster the cells using densityClust package and compare the results from epiConv-full and epiConv-simp:
source("~/epiConv/epiConv_functions.R")
res_epiConv<-readRDS(file="pbmc5k/res_epiConv.rds")
colsfun<-colorRampPalette(c("red","chocolate","darkorange","gold","darkgreen","green3","darkolivegreen1",
"blue","lightblue","cyan","lightpink","magenta","purple"))
library(densityClust)
ncluster<-10
dclust_obj<-densityClust(res_epiConv@embedding[["sampl_full_denoise"]],gaussian=T)
rho_cut<-quantile(dclust_obj$rho,0.5)
delta_cut<-sort(dclust_obj$delta[dclust_obj$rho>=rho_cut],decreasing=T)[ncluster+1]
clust<-findClusters(dclust_obj,rho=rho_cut,delta=delta_cut)$clusters
plot(res_epiConv@embedding[["sampl_full_denoise"]],pch="+",col=colsfun(ncluster)[clust])
legend("bottomright",legend=1:ncluster,col=colsfun(ncluster),cex=2,pch="+")
plot(res_epiConv@embedding[["sampl_simp_denoise"]],pch="+",col=colsfun(ncluster)[clust])
legend("bottomright",legend=1:ncluster,col=colsfun(ncluster),cex=2,pch="+")
plot(res_epiConv@embedding[["sampl_full_decomp"]],pch="+",col=colsfun(ncluster)[clust])
legend("bottomright",legend=1:ncluster,col=colsfun(ncluster),cex=2,pch="+")
In the case of PBMCs, results by different methods are similar. In the following analyzes, we use the similarity matrix from epiConv-full by knn-based method for differential analysis.
freq<-freq.estimate(res_epiConv@mat[["peak"]])
zsingle<-zscore.single(mat=res_epiConv@mat[["peak"]][freq>0.01,],
Smat=res_epiConv[["sampl_full_denoise"]]+res_epiConv[["sampl_full"]]/100,
qt=0.05,
lib_size=res_epiConv$lib_size_full)
marker<-rownames(zsingle)[apply(zsingle,1,function(x) sum(x>2)/length(x))>0.1]
zscore.single
: calculate the z-scores.mat
: the accessibility matrix. Here we only examine the peaks with frequency > 0.01.Smat
: the similarity matrix. We useres_epiConv[["sampl_full_denoise"]]+res_epiConv[["sampl_full"]]/100
to avoid cells with equal similarities as the similarities in denoised matrix are integer values.qt
: the percent of cells as neighbors. For example, qt = 0.05 means the top 5% cells with highest similarities is the cell's neighbors.lib_size
: the scaling factor of single cells. Here we use the library size inferred in previous steps. Here, we select peaks with z-score > 2 in at least 10% cells as differentially accessible peaks:
Finally we plot the heatmap of zscores:
dis<-1-cor(t(zsingle[marker,]),method="spearman")
umap_settings<-umap::umap.defaults
umap_settings$input<-"dist"
umap_settings$n_components<-1
xcoor<-umap::umap(dis,config=umap_settings)$layout[,1]
peak_odr<-order(xcoor)
umap_settings<-umap::umap.defaults
umap_settings$input<-"dist"
umap_settings$n_components<-1
Smat<-res_epiConv[["sampl_full_denoise"]]
umap_res<-umap::umap(as.matrix(max(Smat)-Smat),config=umap_settings)$layout
res_epiConv<-add.embedding(obj=res_epiConv,x=umap_res,name="pbmc5k1d")
odr<-order(res_epiConv@embedding[["pbmc5k1d"]])
heatmap.plus::heatmap.plus(zsingle[marker[peak_odr],odr],
col=colorRampPalette(c("lightblue","black","yellow"))(32),
labCol=F,
labRow=F,
ColSideColors=cbind(colsfun(ncluster)[clust[odr]],"white"),
Rowv=NA,
Colv=NA)
Cells from cluster 3 and 9 (NK and effector CD8+ T cells) share many common markers. In order to find differentially accessible peaks between them, we remove other cells to prevent selecting their common markers:
retain<-which(clust%in%c(3,9))
freq<-freq.estimate(res_epiConv@mat[["peak"]][,retain])
zsingle2<-zscore.single(mat=res_epiConv@mat[["peak"]][freq>0.01,retain],
Smat=res_epiConv[["sampl_full_denoise"]][retain,retain],
qt=0.1,
lib_size=res_epiConv$lib_size_full[retain])
marker2<-rownames(zsingle2)[apply(zsingle2,1,function(x) sum(x>2)/length(x))>0.1]
dis<-1-cor(t(zsingle2[marker2,]),method="spearman")
umap_settings<-umap::umap.defaults
umap_settings$input<-"dist"
umap_settings$n_components<-1
xcoor<-umap::umap(dis,config=umap_settings)$layout[,1]
peak_odr<-order(xcoor)
odr<-order(res_epiConv@embedding[["pbmc5k1d"]][retain])
heatmap.plus::heatmap.plus(zsingle2[marker2[peak_odr],odr],
col=colorRampPalette(c("lightblue","black","yellow"))(32),
labCol=F,
labRow=F,
ColSideColors=cbind(colsfun(ncluster)[clust[retain][odr]],"white"),
Rowv=NA,
Colv=NA)