/cRegulon

cRegulon is an optimization model to identify combinatorial regulon from single cell expression and chromatin accessibility data.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

cRegulon

Introduction

This is cRegulon software: an optimization model an optimization model for modeling combinatorial regulation from single-cell multi-omics provides units underpinning cell type landscape.

Requirements:

  1. Homer in your environment PATH then install hg38 and mm10
  2. Python >=3.0 and packages below:
  3. numpy-groupies==0.9.15
  4. numpy==1.26.3
  5. scipy==1.12.0
  6. pandas==1.4.3
  7. ismember==1.0.2
  8. mpmath==1.2.1
  9. pybedtools==0.9.0
  10. scikit-learn==1.4.2

Installing cRegulon with the following command:

wget https://github.com/SUwonglab/cRegulon/archive/master.zip
unzip master.zip
cd cRegulon-master
wget -O cRegulonData.tar.gz https://figshare.com/ndownloader/files/47503895
tar -xzvf cRegulonData.tar.gz

We also provide an example for cRegulon analysis, which can be downloaded from the following command:

wget -O cRegulonExample.tar.gz https://figshare.com/ndownloader/files/49856427
tar -xzvf cRegulonExample.tar.gz

After downloading, there will be 5 folders that store raw data, intermediate results, and final results of example:
./example_data/
./CSI/
./PseudoBulk/
./Networks/
./Results/

Tutorial of cRegulon analysis

Input single cell data of cRegulon

The typic input file of scRNA-seq data is a gene by cell count matrix:

scRNA RNACellID1 RNACellID2 RNACellID3
Gene1 5 0 3
Gene2 0 2 0
Gene3 1 0 0
The typic input file of scATAC-seq data is a peak by cell count matrix:
scATAC ATACellID1 ATACellID2 ATACellID3 ATACellID4
Peak1 1 0 1 0
Peak2 0 1 0 1
Peak3 1 0 0 0
The peaks are in the format of "chr_start_end".

In practice, the file of single cell dataset can be the 10x folder format (example dataset: ./example_data/RA/) or a matrix txt format (example dataset: ./example_data/CL/). In this tutorial, we use RA dataset for demonstration.

Input cell type meta data

The typic cell type meta file (./example_data/RA/RA_scRNA_Cluster.txt) of scRNA-seq data is as follows:

RNACellID1 RAC1
RNACellID2 RAC2
RNACellID3 RAC3
The typic cell type meta file (./example_data/RA/RA_scATAC_Cluster.txt) of scATAC-seq data is as follows:
ATACellID1 RAC2
ATACellID2 RAC3
ATACellID3 RAC1
ATACellID4 RAC2

cRegulon has three main modes, which are three steps of cRegulon model

python3 cRegulon.py {prep,grn,model} ...

prep: Preprocessing mode
grn: GRN mode
model: Model mode

Step 1: preprocessing and pseudo bulk (prep mode)

We run the following script to create pseudo bulk RNA-seq and ATAC-seq data for each cell cluster:

#cRegulon.py prep [-h] [--name NAME] --rna RNA --rna_meta RNA_META --atac ATAC --atac_meta ATAC_META --species SPECIES (human or mouse)
python3 cRegulon.py prep --name RA --rna ./example_data/RA/scRNA/ --rna_meta ./example_data/RA/RA_scRNA_Cluster.txt --atac ./example_data/RA/scATAC/ --atac_meta ./example_data/RA/RA_scATAC_Cluster.txt -g mouse

This process will produce pseudo bulk files (*PS_RNA.txt, *PS_ATAC.txt, *CellType.txt) for each cell cluster in the PseudoBulk folder.

Step 2: GRN construction (grn mode)

We run the following script to construct regulatory network for each cell cluster (current we support hg38 and mm10):

#python3 cRegulon.py grn [-h] --name NAME --celltype CELLTYPE --genome GENOME --cores CORES
for c in `cat ./PseudoBulk/RA_CellType.txt`
do
    python3 cRegulon.py grn -n RA -ct ${c} -g mm10 -p 20
done

This process will produce GRN files (*network.txt, TFTG_regulationScore.txt, TFName.txt, TGName.txt) for each cell cluster in the Networks folder (The GRN construction is independent for each cell cluster, we can do it parallelly).

Step 3: Running cRegulon model (model mode)

We run the following script of cRegulon model: If we already know or have some expection of the cRegulon number, we can provide this number to cRegulon. For example, we have 9 cRegulons for RA, then we run this script:

#python3 cRegulon.py model [-h] --name NAME --module_number MODULE_NUMBER
python3 cRegulon.py model -n RA -mn 9

If we don't know the cRegulon number, we can provide a range of numbers and cRegulon will use elbow rule to select an optimal number. For example, we guess there may be 4-20 cRegulons for RA, then we run this script:

#python3 cRegulon.py model [-h] --name NAME --module_max MODULE_MAX --module_min MODULE_MIN
python3 cRegulon.py model -n RA -mmin 4 -mmax 20

This will output a folder in "Results" with name you specify: "./RA/"

  1. TF combinatorial effects in each cRegulon: ./Results/RA/X.txt
  2. Association matrix between cell clusters and cRegulons: ./Results/RA/A.txt
  3. TF module of each cRegulon: ./Results/RA/*TFModule.txt
  4. Annotation of each cell cluster with cRegulons: ./Results/RA/Annotation/*subnetwork.txt

Annotation mode of cRegulon (annot mode)

If you only have scRNA-seq data, We run the following script to annotate cells with our pre-computed cRegulons from atlas-level dataset:

#python3 cRegulon.py annot [-h] --name NAME --path_rna PATH_RNA --module_number MODULE_NUMBER
python3 cRegulon.py annot --name PBMC --path_rna ./example_data/PBMC/PBMC_scRNA.txt --module_number 12

Citation:

If you use cRegulon software or cRegulon associated concepts, please cite:

Zhanying Feng, et al. Modeling combinatorial regulation from single-cell multi-omics provides units underpinning cell type landscape. 2024.