Hurdle Model-based Method for Peak-to-Gene Linkage Analysis.
- Accounting for excess zeros in single-nucleus RNA data based on a two-component mixture Hurdle model
- Modeling linkages between peak open chromatin (ATAC) and gene expression (RNA) using regression model with covariates
- Flexible specification of analysis using cells from a given cell type of interest, each cell type or all cells
devtools::install_github("hbliu/Open4Gene")
library(Open4Gene)
- Load example data (./Open4Gene/inst/extdata/Open4Gene.Test.Data)
load(system.file("extdata", "Open4Gene.Test.Data", package = "Open4Gene"))
ls()
# "ATAC.Counts", "RNA.Counts", "Gene.Annotation", "Meta.Data", "Peak.Gene"
head(Peak.Gene)
#Peak Gene
#chr5-39400433-39402082 DAB2
#chr5-39400433-39402082 DAB2
- Preparing the object for Open4Gene analysis
Open4Gene.obj <- CreateOpen4GeneObj(RNA = RNA.Counts,
ATAC = ATAC.Counts,
Meta.data = Meta.Data,
Peak2Gene.Pairs = Peak.Gene,
Covariates = c("lognCount_RNA","percent.mt"),
Celltypes = "Cell_Type")
- Run Open4Gene analysis
Open4Gene.obj <- Open4Gene(object = Open4Gene.obj,
Celltype = "All", # Other options: Cell type name, e.g., "PT"; or "Each" to analyze each cell type
Binary = FALSE, # Binarize ATAC data if binary = TRUE
MinNum.Cells = 5) # Minimal number of cells with both RNA > 0 and ATAC > 0 for association test
- Output Open4Gene result
write.table(Open4Gene.obj@Res, file = "Open4Gene.obj.All.res.txt", sep="\t", col.names=TRUE, row.names=FALSE, quote=FALSE)
On output, Open4Gene.obj@Res provides the following values for each gene~peak pair:
- Peak
- Gene
- Celltype (Cell type name, "All" means all cell in the input data)
- TotalCellNum (Total number of cells used for the analysis)
- ExpressCellNum (Number of cells expressing given gene, RNA read count > 1)
- OpenCellNum (Number of cells with open chromatin in given peak, ATAC read count > 1)
- hurdle.res.zero.beta (beta value of zero hurdle model, binomial with logit link) *
- hurdle.res.zero.se (standard error value of zero hurdle model)
- hurdle.res.zero.z (z value of zero hurdle model)
- hurdle.res.zero.p (p value of zero hurdle model) *
- hurdle.res.count.beta (beta value of count model, truncated negbin with log link) *
- hurdle.res.count.se (standard error value of count model)
- hurdle.res.count.z (z value of count model)
- hurdle.res.count.p (p value of count model) *
- hurdle.AIC (Akaike information criterion of hurdle model)
- hurdle.BIC (Bayesian information criterion of hurdle model)
- spearman.rho (Spearman's rank correlation coefficient between RNA and ATAC)
- spearman.p (Spearman's p value between RNA and ATAC)
You may need columns with (*) for the downstream analysis.
This section describes how to prepare the input files for Open4Gene. Open4Gene needs following input data and parameters:
- RNA [dgCMatrix] Sparse matrix of scRNAseq read count, gene in row and cell in column
- ATAC [dgCMatrix] Sparse matrix of scATACseq read count, gene in row and cell in column
- Meta.data [data.frame] Meta data table with covariates; with cell ID in the rownames
- Covariates [character] Assign covariates that are needed for the analysis
- Celltypes [character] Assign celltype column from Meta.data
- Peak2Gene.Pairs [data.frame] Dataframe that contains Peak~Gene pairs to analyze using Open4Gene
- Gene.Annotation [GRanges] Gene annotation, e.g. EnsDb.Hsapiens.v75
- Peak2Gene.Dis [integer] Distance (peak to gene body), default is 100000 (bp)
1. RNA and ATAC read count matrix
Code for extracting count matrix and cell information from Seurat object with both RNA and ATAC assays:
RNA.Counts <- Seurat.object@assays$RNA@counts
ATAC.Counts <- Seurat.object@assays$ATAC@counts
Meta.Data <- Seurat.object@meta.data
Note here that, the cell IDs from different matrix should match with each other, e.g. columns of RNA matrix, columns of ATAC matrix, and rows of Meta.Data.
2. Meta.Data
This is a dataframe that contains cell information from the single cell RNA and ATAC, as following.
orig.ident | lognCount_RNA | percent.mt | Cell_Type | |
---|---|---|---|---|
HK2888_GTTTAACCAGCTCAAC-1 | HK2888 | 6.61 | 5.70 | PT |
HK2888_GGCTAGTGTCATGCCC-1 | HK2888 | 7.66 | 2.36 | Injured_PT |
HK2888_GGCTAGTGTAAGCTCA-1 | HK2888 | 7.49 | 0.04 | LOH |
Note here that, the cell IDs should be provided in the rownames. Make and check it using
Meta.Data <- Seurat.object@meta.data
head(rownames(Meta.Data))
3. Covariates
The covariates are used for the regression analysis. Features of cells in metadata can be used as covariates, e.g. lognCount_RNA, percent.mt.
4. Peak2Gene.Pairs
This is a dataframe that contains Peak~Gene pairs for Open4Gene, Peak (first column) and Gene (second column), as following.
Peak | Gene |
---|---|
chr5-39400433-39402082 | DAB2 |
chr5-39369336-39370159 | DAB2 |
5. Preparing the object for Open4Gene analysis using Gene.Annotation
Open4Gene can pick up the Peak2Gene.Pairs based on peak and gene distance (Peak2Gene.Dis) based on the input data. Here, the Gene.Annotation is a gene annotation in GRanges object, e.g. EnsDb.Hsapiens.v75. Code for preparing object for Open4Gene analysis using Gene.Annotation of EnsDb.Hsapiens.v75.
library(EnsDb.Hsapiens.v75)
Gene.Annotation <- genes(EnsDb.Hsapiens.v75)
Open4Gene.obj <- CreateOpen4GeneObj(RNA = RNA.Counts, ATAC = ATAC.Counts, Meta.data = Meta.Data,
Gene.Annotation = Gene.Annotation,
Peak2Gene.Dis = 100000,
Covariates = c("lognCount_RNA","percent.mt"),
Celltypes = "Cell_Type")
Then, Open4Gene.obj will be inputed to Run Open4Gene analysis.
1. Run Open4Gene for a given cell type, e.g. PT
Open4Gene.obj <- Open4Gene(object = Open4Gene.obj,
Celltype = "PT",
Binary = FALSE,
MinNum.Cells = 5)
2. Run Open4Gene for each cell type
Open4Gene.obj <- Open4Gene(object = Open4Gene.obj,
Celltype = "Each",
Binary = FALSE,
MinNum.Cells = 5)
3. Run Open4Gene using all cells
Open4Gene.obj <- Open4Gene(object = Open4Gene.obj,
Celltype = "All",
Binary = FALSE,
MinNum.Cells = 5)
It is time-consuming to run genome-wide peak-to-gene linkage analysis using Open4Gene. Open4Gene analysis on 3000 pairs takes about 5 hours. To perform genome-wide analysis, we recommend using Peak2Gene.Pairs to control the number of pairs analyzed in each chunk.
For any question, you are welcome to report your issue in Github or contact us hongbo919@gmail.com and ksusztak@pennmedicine.upenn.edu