simCAS is an embedding-based method for simulating single-cell chromatin accessibility sequencing data. simCAS is a comprehensive and flexible simulator which provides three simulation modes: pseudo-cell-type mode, discrete mode and continuous mode. In pseudo-cell-type mode simCAS simulates data with high resemblance to real data. In discrete or continuous mode, simCAS simualtes cells with discrete or continuous states. simCAS also provides functions to simulate data with batch effects and interactive peaks. The synthetic scCAS data generated by simCAS can benifit the benchmarking of various computational methods.
- If Anaconda (or miniconda) is already installed with Python3, skip to 2) otherwise please download and install Python3 Anaconda from here: https://www.anaconda.com/download/
- Create a new conda environment to run simCAS:
$ conda create -n simCAS python=3.8.10 r-base
$ conda activate simCAS
- In our package, we use rpy2 package to use an R package named ape. The package needs to be installed in the conda environment:
$ R
> install.packages("ape")
- The required Python packages are shown as below:
Requirements:
1. Python 3.8.13 or greater version
2. Packages for simCAS
numpy (>=1.21.0)
pandas (>=1.3.5)
scipy (>=1.4.1)
rpy2 (>=3.5.5)
scikit-learn (>=1.2.0)
scanpy (>=1.9.1)
Bio (>=1.5.2)
anndata (>=0.8.0)
statsmodels (>=0.13.2)
episcanpy (==0.3.2)
3. Packages for tutorials
matplotlib (>=3.5.1)
seaborn (>=0.11.2)
umap-learn (>=0.5.2)
- Install the packages and use simCAS with the following commands:
Package installation:
$ git clone https://github.com/Chen-Li-17/simCAS
$ cd simCAS
$ pip install -r requirements.txt
- For the three basic simulation modes, we provide detailed tutorials:
- We also provide bash commands for users to more conveniently use simCAS:
$ cd simCAS/code
Pesudo-cell-type mode:
$ python simCAS.py --simu_type cell_type --activation sigmod --adata_dir ../data/adata_forsimulation.h5ad
Discrete mode
$ python simCAS.py --simu_type discrete --n_cell_total 1500 --activation exp_linear --tree_text '(((A:0.2,B:0.2):0.2,C:0.4):0.5,(D:1,E:1):1);' --pops_name A B C D E
Continuous mode
$ python simCAS.py --simu_type continuous --n_cell_total 1500 --activation exp_linear --tree_text '((((A:1, B:1)C:1,D:2)E:0.5, F:2)R)'
The details of the parameters are in the Parameter setting part.
To evaluate computational methods more than identify cell states, we also provide the simulation of synthetic scCAS data with user-defined batch effects and peak interactions. Detailed tutorials are shown as below:
The simulation function is simCAS_generate, the details of the parameters are shown as below:
simCAS_generate(
peak_mean=None,
lib_size=None,
nozero=None,
n_peak=100000.0,
n_cell_total=1500,
rand_seed=2022,
zero_prob=0.5,
zero_set='all',
effect_mean=0,
effect_sd=1,
min_popsize=300,
min_pop=None,
tree_text=None,
pops_name=None,
pops_size=None,
embed_mean_same=1,
embed_sd_same=0.5,
embed_mean_diff=1,
embed_sd_diff=0.5,
len_cell_embed=12,
n_embed_diff=10,
n_embed_same=2,
simu_type='discrete',
correct_iter=2,
activation='exp_linear',
two_embeds=True,
adata_dir=None,
lib_simu='estimate',
distribution='Poisson',
bw_pm=0.0001,
bw_lib=0.05,
bw_nozero=0.05,
real_param=False,
K=None,
A=None,
stat_estimation='one_logser',
cell_scale=1.0
)
Description
----------
generate scCAS data with three modes: pseudo-cell-type mode, discrete mode, continuous mode.
Parameter
----------
peak_mean: 1D numpy array, default=None
Real peak mean of scCAS data. Used for statistical estimation.
lib_size: 1D numpy array, default=None
Real library size of scCAS data. Used for statistical estimation.
nozero: 1D numpy array, default=None
Real non-zeros of scCAS data. Used for statistical estimation.
n_peak: int, default=1e5
Number of peaks in the synthetic data.
n_cell_total: int, default=1500
Number of simulating cells.
rand_seed: int, default=2022
Random seed for generation.
zero_prob: float, default=0.5
The probability of zeros in PEM.
zero_set: str, default='all'
How to set the PEM values to zero.
1.'all': set the PEM values to zero with probability zero_prob for the whole PEM.
2.'by_row': set the PEM values to zero with probability zero_prob for each row (peak) of PEM.
effect_mean: float, default=0.0
Mean of the Gaussian distribution, from which the PEM values are sampled.
effect_sd: float, default=1.0
Standard deviation of the Gaussian distribution, from which the PEM values are sampled.
min_popsize: int, default=300
The cell number of the minimal population set in the discrete mode. The number should be less than n_cell_total.
min_pop: str, default=None
The name of the minimal population. This should be contained pops_name
tree_text: str, default=None
A string of Newick format. In discrete mode this is used to define the covariance matrix of different populations. In continuous mode this is used to provide the differentiation trajectory.
pops_name: list, default=None
A list of defined names of populations.
pops_size: list, default=None
The number of cells of corresponding populations in the pops_name.
embed_mean_same: float, default=1.0
Mean of the Gaussian distritbution, from which the homogeneous CEM values are sampled.
embed_sd_same: float, default=0.5
Standard deviation of the Gaussian distritbution, from which the homogeneous CEM values are sampled.
embed_mean_diff: float, default=1.0
Mean of the Gaussian distritbution, with which the heterogeneous CEM values are generated.
embed_sd_diff: float, default=0.5
Standard deviation of the Gaussian distritbution, with which the heterogeneous CEM values are generated.
len_cell_embed: int, default=12
The number of the total embedding dimensions.
n_embed_diff: int, default=10
The number of the heterogeneous embedding dimensions.
simu_type: str, default='cell_type'
1.'cell_type': Pseudo-cell-type mode. Simulate data resembling real data.
2.'discrete': Simulate cells with discrete populations.
3.'continuous': Simulate cells with continuous trajectories.
correct_iter: int, default=2
The iterations of correction.
activation: str, default='exp_linear'
Choose the activation function to convert parameter matrix values to positive.
1.'exp_linear': Used in the discrete mode or continuous mode.
2.'sigmod': Used in the pseudo-cell-type mode.
3.'exp': Used for biological batch effect generation.
adata_dir: str, default=None
The directory of real scCAS anndata.
distribution: str, default='Poisson'
Choose the distribution of data.
1.'Poisson': for count data.
2.'Bernoulli': for binary data.
bw_pm: float, default=1e-4
Band width of the kernel density estimation for peak mean.
bw_lib: float, default=0.05
Band width of the kernel density estimation for library size.
bw_nozero: float, default=0.05
Band width of the kernel density estimation for non-zero.
K: float, default=None
Adjust the slope of the activation function.
A: float, default=None
Adjust the slope of the activation function.
stat_estimation: str, default='one_logser'
Different discrete distributions to fit peak summation.
1. 'zero_logser': a variant of logarithmic distribution.
2. 'one_logser': a variant of logarithmic distribution.
3. 'ZIP' : zero-inflated Poisson distribution.
4. 'NB': Negative Binomial distribution.
4. 'zero_NB': a variant of NB distribution.
5. 'ZINB': zero-inflated Negative Binomial distribution.
cell_scale: float, default=1.0
when conduct pseudo-celltype simulating mode, cell_scale is the magnification of the original cell number.
Return
----------
adata_final: anndata
The simulated scCAS data with anndata format. The cell type information is in the observation.