How to improve same-cell scRNA/scATAC-seq integration results?

Question

How to improve same-cell scRNA/scATAC-seq integration results?

Closed this issue 2 years ago · 2 comments

Greetings developers!

We are using Portal for single-cell same-cell RNA/ATAC-seq integration. However, the output UMAP clusters seem not sufficiently integrated (cells from either assay do not overlap).

Our test data: https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0.

For RNA assay, we filtered the cells using Seurat (no normalization applied), and extracted the raw count matrix as an input for Portal.

For ATAC assay, we used ArchR to filter the cells and generate a raw count matrix (GeneScoreMatrix: https://www.archrproject.com/bookdown/calculating-gene-scores-in-archr.html).

Our final input included 10338 cells from the RNA assay and 9939 cells from the ATAC assay, where 9213 cell barcodes intersected. Portal identified 4000 highly variable genes in each assay, of which 267 intersected.

We followed the multiomic integration protocol: https://htmlpreview.github.io/?https://github.com/jiazhao97/Portal-reproducibility/blob/main/Reproduce-PBMC-ATACseq.html, with no deviation except for the input. Since cells from RNA and ATAC assays were concatenated, we expected same cells to overlap, but in the plot they didn't.

We would appreciate your insight!

Thank you,
Yuying

Answer 1 · 2023-03-22T02:03:55.000Z

Hi Yuying,

Thank you for using Portal.

The default setting for integrating scRNA-seq and scATAC-seq is running Portal with lambdacos=10.0, as shown in the tutorial. Alternatively, you may try tuning parameter lambdacos (link) with the following code:

adata_list = [adata_rna, adata_atac]
lowdim_list = portal.utils.preprocess_datasets(adata_list)
integrated_data = portal.utils.integrate_datasets(lowdim_list, search_cos=True)

Sometimes, integration with tuning parameter lambdacos yields better result.

Also, you may try changing the number of highly variable genes.

Best,
Jia

Answer 2 · 2023-03-29T14:52:00.000Z

Hello Jia,

Thank you for your suggestions!

The optimal lambdacos calculated from portal.utils.integrate_datasets is 50.0. This lambdacos decreased overlap of same cells.

hvg_num can be set as the third parameter in model.preprocess(adata_rna, adata_atac, hvg_num=4000) as described in model.py. Increasing hvg_num to 12000 increased the number of shared hvg between layers from 267 to 2797. More shared hvg improved cluster separation but decreased overlap of same cells.

Setting lambdacos=50.0 and hvg_num=12000 simultaneously resulted in the least overlap.

It seems that optimizing lambdacos and hvg_num revealed more distinction between sequencing methods than improved integration of same-cells.

⬆️ nPC30_nLatent20_lamdacos10_hvg4000 (tutorial default)

⬆️ nPC30_nLatent20_lamdacos50_hvg4000

⬆️ nPC30_nLatent20_lamdacos10_hvg12000

⬆️ nPC30_nLatent20_lambdacos50_hvg12000

We speculated that optimizing lambdacos had the opposite effect, because our test multiome dataset has less cells (10k) than the multiome dataset (60k) used in the benchmarking study that used Portal. What would be your insight on this? :)

Since our dataset is smaller, we tweaked the other two parameters of portal.model.Model(npcs=30, n_latent=20, lambdacos=10.0,), described in model.py. At npcs=20, n_latent=30, and hvg_num=12000, we saw the best overlap.

⬆️ nPC15_nLatent10_lamdacos10_hvg4000