How to improve same-cell scRNA/scATAC-seq integration results?
Closed this issue · 2 comments
Greetings developers!
We are using Portal for single-cell same-cell RNA/ATAC-seq integration. However, the output UMAP clusters seem not sufficiently integrated (cells from either assay do not overlap).
Our test data: https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0.
For RNA assay, we filtered the cells using Seurat (no normalization applied), and extracted the raw count matrix as an input for Portal.
For ATAC assay, we used ArchR to filter the cells and generate a raw count matrix (GeneScoreMatrix: https://www.archrproject.com/bookdown/calculating-gene-scores-in-archr.html).
Our final input included 10338 cells from the RNA assay and 9939 cells from the ATAC assay, where 9213 cell barcodes intersected. Portal identified 4000 highly variable genes in each assay, of which 267 intersected.
We followed the multiomic integration protocol: https://htmlpreview.github.io/?https://github.com/jiazhao97/Portal-reproducibility/blob/main/Reproduce-PBMC-ATACseq.html, with no deviation except for the input. Since cells from RNA and ATAC assays were concatenated, we expected same cells to overlap, but in the plot they didn't.
We would appreciate your insight!
Thank you,
Yuying
Hi Yuying,
Thank you for using Portal.
The default setting for integrating scRNA-seq and scATAC-seq is running Portal with lambdacos=10.0
, as shown in the tutorial. Alternatively, you may try tuning parameter lambdacos
(link) with the following code:
adata_list = [adata_rna, adata_atac]
lowdim_list = portal.utils.preprocess_datasets(adata_list)
integrated_data = portal.utils.integrate_datasets(lowdim_list, search_cos=True)
Sometimes, integration with tuning parameter lambdacos
yields better result.
Also, you may try changing the number of highly variable genes.
Best,
Jia
Hello Jia,
Thank you for your suggestions!
The optimal lambdacos
calculated from portal.utils.integrate_datasets
is 50.0. This lambdacos
decreased overlap of same cells.
hvg_num
can be set as the third parameter in model.preprocess(adata_rna, adata_atac, hvg_num=4000)
as described in model.py. Increasing hvg_num
to 12000 increased the number of shared hvg between layers from 267 to 2797. More shared hvg improved cluster separation but decreased overlap of same cells.
Setting lambdacos=50.0
and hvg_num=12000
simultaneously resulted in the least overlap.
It seems that optimizing lambdacos
and hvg_num
revealed more distinction between sequencing methods than improved integration of same-cells.
⬆️ nPC30_nLatent20_lamdacos10_hvg4000 (tutorial default)
⬆️ nPC30_nLatent20_lamdacos50_hvg4000
⬆️ nPC30_nLatent20_lamdacos10_hvg12000
⬆️ nPC30_nLatent20_lambdacos50_hvg12000
We speculated that optimizing lambdacos
had the opposite effect, because our test multiome dataset has less cells (10k) than the multiome dataset (60k) used in the benchmarking study that used Portal. What would be your insight on this? :)
Since our dataset is smaller, we tweaked the other two parameters of portal.model.Model(npcs=30, n_latent=20, lambdacos=10.0,)
, described in model.py. At npcs=20
, n_latent=30
, and hvg_num=12000
, we saw the best overlap.
⬆️ nPC15_nLatent10_lamdacos10_hvg4000
⬆️ nPC15_nLatent20_lamdacos10_hvg4000
⬆️ nPC15_nLatent30_lamdacos10_hvg4000
⬆️ nPC20_nLatent10_lamdacos10_hvg4000
⬆️ nPC20_nLatent20_lamdacos10_hvg4000
⬆️ nPC20_nLatent30_lamdacos10_hvg4000
More hvg:
⬆️ nPC15_nLatent10_lamdacos10_hvg12000
⬆️ nPC15_nLatent20_lamdacos10_hvg12000
⬆️ nPC15_nLatent30_lamdacos10_hvg12000
⬆️ nPC20_nLatent10_lamdacos10_hvg12000
⬆️ nPC20_nLatent20_lamdacos10_hvg12000
⬆️ nPC20_nLatent30_lamdacos10_hvg12000 (best)
⬆️ nPC20_nLatent40_lamdacos10_hvg12000