Starlitnightly/omicverse

A question about the existence of randomness in OV computed metacells

Closed this issue · 6 comments

Hi Dr.Zeng,

Thanks for the excellent OV tutorials and OV has become my favorite python package. However, I'm having some problems with randomness when using ov to infer metacell. Following the codes in tutorial 单细胞测序最好的课程(十一), I noticed that the umap graphs I calculated didn't match the ones in the tutorial. It should be emphasized that I did not change the code of the tutorial at all.

For example (The left graph is from the tutorial, the right is from my calculations) :
image

or
image

And then the auto-calculated fold-change is different:

in tutorial : Fold change threshold: 1.4815978396550413
in my results: Fold change threshold: 2.487487769064068

Even if I manually set fc = 1.48, the difference is still a bit big.
image

The surprise, however, was that there wasn't much difference when visualizing a particular gene.
image

May I ask how this type of phenomenon can be explained and how the tutorial can be reproduced exactly?

As a former R user, I've only recently started using Python for single-cell analysis, so I don't really understand stochastic events for deep learning. Is it possible to fix these random events in OV ?

By the way, a question, is it possible to add some parameters to allow the user to easily add statistics to barplot ? Because I find that labeling p-values in the plot can be a bit confusing for users like me.

Thank you very much for reading! I hope I have articulated the problem clearly -.-

Another question to add, given the heterogeneity between subgroups or samples, should the metacell be calculated independently by sample ? I don't know much about algorithms, can you explain why? Thanks!

Hi Dr.Zeng,

Thanks for the excellent OV tutorials and OV has become my favorite python package. However, I'm having some problems with randomness when using ov to infer metacell. Following the codes in tutorial 单细胞测序最好的课程(十一), I noticed that the umap graphs I calculated didn't match the ones in the tutorial. It should be emphasized that I did not change the code of the tutorial at all.

For example (The left graph is from the tutorial, the right is from my calculations) : image

or image

And then the auto-calculated fold-change is different:

in tutorial : Fold change threshold: 1.4815978396550413
in my results: Fold change threshold: 2.487487769064068

Even if I manually set fc = 1.48, the difference is still a bit big. image

The surprise, however, was that there wasn't much difference when visualizing a particular gene. image

May I ask how this type of phenomenon can be explained and how the tutorial can be reproduced exactly?

As a former R user, I've only recently started using Python for single-cell analysis, so I don't really understand stochastic events for deep learning. Is it possible to fix these random events in OV ?

By the way, a question, is it possible to add some parameters to allow the user to easily add statistics to barplot ? Because I find that labeling p-values in the plot can be a bit confusing for users like me.

Thank you very much for reading! I hope I have articulated the problem clearly -.-

Thanks for your support of omicverse.

-1. The calculation of umap is subject to random seed interference, so just make sure the visualisation is consistent, there is no method to ensure that umap is exactly the same every time, see the explanation of seed in machine learning.
-2. The aggregation of metacells is also a deep learning process, due to the inconsistency of the random seed seed, so the metacells trained by each aggregation will be different, so the foldchange will also be different, just need to make sure that the most significant differences in the genes are consistent.
-3. The addition of p-values is really a matter of concern, I will add the relevant plotting function in subsequent versions, or you can use the following function to manually plot the p-values if you need to.

#plot line
plt.plot((0,1),(2,2),c='#000000')
#plot pvalue/*
plt.text((1+0)/2,2.5,
         '$p=0.01$',horizontalalignment='center',fontsize=10)

This code will draw a line between the horizontal coordinates 0, 1, at a height of 2, with the p-value labelled above.

sincerely
Zehua

Another question to add, given the heterogeneity between subgroups or samples, should the metacell be calculated independently by sample ? I don't know much about algorithms, can you explain why? Thanks!

Yes, you can calculate the metacell independently by sample. In our tutorial, I calculate the metacells on the embedded X_scVI after removing the batch effect or the embedded X_pca without removing the batch effect for convenience as well as to remove the batch effect. Only I name the cell type labels and samples in tandem via celltype-sample.

Zehua

Thanks for your patient answer. Tandem celltype and sample is a good idea, and I am trying to apply it in my own research. I'm looking forward to omicverse's continued updates.

Another question to add, given the heterogeneity between subgroups or samples, should the metacell be calculated independently by sample ? I don't know much about algorithms, can you explain why? Thanks!

Yes, you can calculate the metacell independently by sample. In our tutorial, I calculate the metacells on the embedded X_scVI after removing the batch effect or the embedded X_pca without removing the batch effect for convenience as well as to remove the batch effect. Only I name the cell type labels and samples in tandem via celltype-sample.

Zehua

Hi Dr.Zeng,

I'll add a small question: even though my umap plot and metacell have some randomness (I've repeated the run to confirm), why is the number of metacells 250 in each case? Is this also a coincidence? If I get a larger amount of data, say 100000 cells, will it also be only 250?

Here is the latest results :

ad=meta_obj.predicted(method='soft',celltype_label='celltype-label',
                     summarize_layer='lognorm')
# 100%|██████████| 250/250 [01:06<00:00,  3.74it/s]

ad
# AnnData object with n_obs × n_vars = 250 × 2000
#    obs: 'Pseudo-sizes', 'celltype', 'celltype_purity'

Another question to add, given the heterogeneity between subgroups or samples, should the metacell be calculated independently by sample ? I don't know much about algorithms, can you explain why? Thanks!

Yes, you can calculate the metacell independently by sample. In our tutorial, I calculate the metacells on the embedded X_scVI after removing the batch effect or the embedded X_pca without removing the batch effect for convenience as well as to remove the batch effect. Only I name the cell type labels and samples in tandem via celltype-sample.
Zehua

Hi Dr.Zeng,

I'll add a small question: even though my umap plot and metacell have some randomness (I've repeated the run to confirm), why is the number of metacells 250 in each case? Is this also a coincidence? If I get a larger amount of data, say 100000 cells, will it also be only 250?

Here is the latest results :

ad=meta_obj.predicted(method='soft',celltype_label='celltype-label',
                     summarize_layer='lognorm')
# 100%|██████████| 250/250 [01:06<00:00,  3.74it/s]

ad
# AnnData object with n_obs × n_vars = 250 × 2000
#    obs: 'Pseudo-sizes', 'celltype', 'celltype_purity'

In fact, the number of metacells can be specified, if left blank then it defaults to the total number of cells/75, with one metacell constructed for every 75 cells (this is a recommendation from the authors)

class MetaCell(object):

    def __init__(self,adata,use_rep,
                 n_metacells=None,
                 use_gpu: bool = False,
                verbose: bool = True,
                n_waypoint_eigs: int = 10,
                n_neighbors: int = 15,
                convergence_epsilon: float = 1e-3,
                l2_penalty: float = 0,
                max_franke_wolfe_iters: int = 50,
                use_sparse: bool = False,) -> None:
        
        if n_metacells is None:
            n_metacells=adata.shape[0]//75