gao-lab/Cell_BLAST

Building models with subsets

Closed this issue · 1 comments

Hi!
When I try to build a model with a subset of the HDF5 files provided by Cell Blast, I find that DIRECTI seems to be stuck. I think both Campbell and Campbell_subset can be used to build models.

What I have to say include:

  1. I will use list or convert the list to array.ndarray (Campbell_subset.uns['expressed_genes1']), and later building the model for campbell_subset is unsuccessful.
  2. But the same method can successfully model Campbell.

Here is my code:
Campbell = cb.data.ExprDataSet.read_dataset("Campbell.h5")
list=['Ace2','Adora2a','Aldh1l1','Amigo2','Ano3','Aqp4'] #These genes are all contained in the expressed genes, and I've only listed a few
Campbell_subset = Campbell[:, list]

%%capture
startblast=time.time()
models = []
for i in range(4):
models.append(cb.directi.fit_DIRECTi(
Campbell_subset,Campbell_subset.uns['expressed_genes1'], latent_dim=10, cat_dim=20,
random_seed=i, path="Campbell_subset_blast_models2/model_%d" % i
))

%%capture
start_time=time.time()
models = []
for i in range(4):
models.append(cb.directi.fit_DIRECTi(
Campbell,Campbell.uns['expressed_genes1'], latent_dim=10, cat_dim=20,
random_seed=i, path="Campbell_blast_models2/model_%d" % i
))

Sincerely waiting for your reply!
Thanks!

Thanks for the report!

After some digging, I found that this is caused by cells containing no reads in the gene subset. Internally, the DIRECTi encoder normalizes cells by their total read count (row sum). So, cells with no reads would produce 0 / 0 = NaN after the normalization. These NaN values causes tensorflow to block instead of throwing an error...

Would you please try removing cells with 0 total count in Campbell_subset, e.g.,

Campbell_subset = Campbell_subset[Campbell_subset.exprs.sum(axis=1) != 0, :]

And see if the model is no longer stuck?

Meanwhile, I recommend doing gene selection via the genes argument of fit_DIRECTi:

models = []
for i in range(4):
    models.append(cb.directi.fit_DIRECTi(
        Campbell_subset, genes=list, latent_dim=10, cat_dim=20,
        random_seed=i, path="Campbell_subset_blast_models2/model_%d" % i
    ))

So that fit_DIRECTi will use the total count in all genes when doing the normalization and avoid the 0 / 0 = NaN problem, while the model is indeed fitted on genes specified in the gene list.

Thanks again for the report! Let me know if any further issues.