broadinstitute/CellBender

Mismatch between summary for algorithm convergence and learning curve

KaczorowskiLab opened this issue · 0 comments

Hi @sjfleming,

Thank you for developing and maintaining this very useful tool. I had two questions stemming from use of the tool on our snRNA-seq dataset. All the samples in the set were run at LR of 1e-5 since I had more than 50% of samples with warnings when run at the default LR value.

Question1: This is related to the convergence of the algorithm assessments. Two samples in my set had the "slightly unusual behavior" warning in the summary. However, examining the learning curves, the shape and values don't seem far off from other examples where in the summary came back as being "normal". The learning curves also look slightly better at the lower training rate (where in the warning appears) when compared to the default value. The performance on remaining metrics is also comparable. Is this a false warning ?

Sample 1: LR = 1e-5 (summary gives warning to try lower LR)

Screenshot 2024-08-06 at 11 15 22 AM

Similar looking learning curve from a different sample at the same LR (summary says curve looks normal)

Screenshot 2024-08-06 at 11 16 55 AM

Sample 1: LR= 1e-4 (summary says curve looks normal)

Screenshot 2024-08-06 at 11 18 37 AM

The shape of the learning curve for Sample 2 is similar and follows same trend as sample 1.

Question 2: This is related to the genes removed and the including warnings. All of the samples in the set both the top 10 genes in the table and a huge list of warnings for genes not included in the table. The warnings in #342 are related to the genes shown in the table. Based on your comment in #292 it seems about 80-90% of the counts associated with the genes are also being removed from the cells (fraction_removed_cells). However, this information is not shown for all the genes included in the warnings. Is this normal behavior ? As list of genes are also not consistent between samples, is there a way to extract this information to cross check for some of the individual genes (your comment in #342). Here are two example outputs (warning list truncated in the image):

Screenshot 2024-08-06 at 11 34 42 AM

Not all genes are mitochondrial in all samples:

Screenshot 2024-08-06 at 11 43 30 AM