statOmics/tradeSeq

BPPARAM with fitGAM

Opened this issue · 8 comments

Hi,

I've trying to work on this for days and hope ppl here can provide me some solution. Currently, I set 6 knots and I have 7 lineages.

My codes are:
`Slingshot <- slingshot(SingleCellExperiment, clusterLabels = colData(SingleCellExperiment)$ident, reducedDim="UMAP", start.clus=StartCluster, approx_point = 150)

library(BiocParallel)
BPPARAM <- BiocParallel::bpparam()
BPPARAM$workers <- 2
counts <- Slingshot@assays@data@listData$counts
slingPseudotime <- slingPseudotime(SlingshotDataSet(Slingshot), na= F)
slingCurveWeights <- slingCurveWeights(SlingshotDataSet(Slingshot))
fitGAM_BPPARAM <-fitGAM(counts=counts, pseudotime = slingPseudotime, cellWeights=slingCurveWeights, conditions = factor(colData(Slingshot)$Exp), nknots=6, verbose = T, parallel=T, BPPARAM = BPPARAM)
`

Here are error msgs:
Error in reducer$value.cache[[as.character(idx)]] <- values : wrong args for environment subassignment Calls: fitGAM ... .bploop_impl -> .collect_result -> .reducer_add -> .reducer_add In addition: Warning messages: 1: In asMethod(object) : sparse->dense coercion: allocating vector of size 28.6 GiB 2: In .findKnots(nknots, pseudotime, wSamp) : Impossible to place a knot at all endpoints.Increase the number of knots to avoid this issue. 3: In parallel::mccollect(wait = FALSE, timeout = 1) : 1 parallel job did not deliver a result Execution halted

Many thanks!

Not from dev team, but here are a few points :

  1. Never use the exact same name for a variable as the name of a function, this can lead to weird behavior. So no slingPseudotime and slingCurveWeights or even Slingshot isn't ideal.
  2. Related to 1. , you don't even need to provide pseudotime and weights parameters, simply use sds = SlingshotDataSet(Slingshot) and it will automatically find those values
  3. counts can simply be counts = assays(Slingshot)$counts

This would simply give something like this :

fitGAM_BPPARAM <-fitGAM(counts=assays(Slingshot)$counts, sds = SlingshotDataSet(Slingshot), conditions = factor(colData(Slingshot)$Exp), nknots=6, verbose = T, parallel=T, BPPARAM = BPPARAM)

Let's see first if you have same error by simplifying the code

Hi,

Thank you for those tips. Unfortunately, I got the same error msg. I came across another post of your (#259) and wonder if you can show me how to use "SnowParam()"

Many thanks

Andrea

Not from dev team, but here are a few points :

  1. Never use the exact same name for a variable as the name of a function, this can lead to weird behavior. So no slingPseudotime and slingCurveWeights or even Slingshot isn't ideal.
  2. Related to 1. , you don't even need to provide pseudotime and weights parameters, simply use sds = SlingshotDataSet(Slingshot) and it will automatically find those values
  3. counts can simply be counts = assays(Slingshot)$counts

This would simply give something like this :

fitGAM_BPPARAM <-fitGAM(counts=assays(Slingshot)$counts, sds = SlingshotDataSet(Slingshot), conditions = factor(colData(Slingshot)$Exp), nknots=6, verbose = T, parallel=T, BPPARAM = BPPARAM)

Let's see first if you have same error by simplifying the code

Hi everyone,

I've been trying to make fitGAM() happen for a long long time. Beyond the suggestion for @Alexis-Varin, using a partition equipped super large memory helped a lot. In my case, I have a dataset with ~30000 genes, 6 samples (2 conditions), ~54000 cells, and 3 lineages, the job finished in a 6T memory partition in 10 days. The max memory used was ~ 200GB.

Now I'm trying to run a dataset with 14 samples, ~110000 cells, and 7 lineages.

Hi @AndreaYCT , I will hopefully have time to update soon my own issue with what I have found so far. To sum up I basically managed to make BatchtoolsParam work on a Slurm partition by modifying the source code of fitGAM (a dataframe is fed to the bplapply function that calculate the GAM for each gene inside fitGAM, which apparently is what throws an error to batchtools, I changed it to a list and everything works) I compute a 40k cells by 14k genes on 7 lineages in 4 hours on 48 cores and 300 GB instead of 8+ days and 1 TB.

I also found that the reason for days long computation is not due to an overall long computation of each gene (basically, an equal amount of time spent on fitting the GAM for each gene) but is due to only a handful of genes that can take days to fit the GAM on them; I was able to find that, because BatchtoolsParam keeps log for each iteration of bplapply in fitGAM, and I found that >95% of genes are fitted in a few seconds, but some took hours. Upon looking at these genes, I found that these are all very lowly represented genes (expressed in 3, 5 cells etc) and with a count > 1. The more the counts per cell in very few cells, the longer it takes (for example, 3 cells with a count of 2, 4, 1 in a 40k cells dataset leads to a incredibly long calculation, dozens of hours). One gene in particular was expressed in a single cell with a count of 52; fitting the GAM on it never quite finished after days.

So I filtered my matrix to only keep genes expressed in at least 10 cells, which is what led me to 14k genes; this removes all problematic genes without removing any information since a gene expressed in less than 10 cells is unlikely to contribute to fate choice or just anything in particular, it is just noise.

Finally, I discovered that using FindVariableFeatures is very much not recommended, due to what I explained above; it contains quite a few genes very lowly expressed (my gene expressed in a single cell with a count of 52 was among the top 3000 selected) and it will lead to very long calculation; on the other hand, using deviant genes (as calculated by devianceFeatureSelection from the scry package) leads to fast computation because all genes are well expressed in the dataset.

Conclusion / TLDR:
If computation ressources are limited, using the top 3000-5000 deviant genes definitely lead to the vast majority of the information captured (genes contributing to pseudotime) with very fast fitGAM computation; avoiding variable genes is very important. If you have access to cluster computing, filtering the counts matrix to genes expressed in more than 10 cells leads to the best result. Avoiding keeping all genes is also very important.

Matrix filtering:

mat <- assays(sds)$counts
mat <- mat[rowSums(mat > 0) > 10, ]

Hi, @Alexis-Varin

Thanks for all these trying and explaining. Your experience is very useful. I will start over happily.

Andrea

Hi, @Alexis-Varin

i'd like to make sure my understanding for the filter code

(1) Is sds the result of Slingshot()
(2) Is mat the "count" for fitGAM()
(3) Do I feed non-filtered "sds" (Slingshot result) for fitGMA?

Many thanks!

Andrea

Matrix filtering:

mat <- assays(sds)$counts
mat <- mat[rowSums(mat > 0) > 10, ]

Hi @AndreaYCT, yes sds is the result of slingshot. Here is what, based on your code above, I would do:

Slingshot <- slingshot(SingleCellExperiment, clusterLabels = colData(SingleCellExperiment)$ident, reducedDim="UMAP", start.clus=StartCluster, approx_point = 150)

library(BiocParallel)
BPPARAM <- BiocParallel::bpparam()

counts <- assays(Slingshot)$counts
counts <- counts[rowSums(counts > 0) > 10, ]

fitGAM_BPPARAM <-fitGAM(counts=counts, sds=SlingshotDataSet(Slingshot), conditions = factor(colData(Slingshot)$Exp), nknots=6, verbose = T, parallel=T, BPPARAM = BPPARAM)

Hi, @Alexis-Varin

Thank you so much!

Andrea

Hi @AndreaYCT, yes sds is the result of slingshot. Here is what, based on your code above, I would do:

Slingshot <- slingshot(SingleCellExperiment, clusterLabels = colData(SingleCellExperiment)$ident, reducedDim="UMAP", start.clus=StartCluster, approx_point = 150)

library(BiocParallel)
BPPARAM <- BiocParallel::bpparam()

counts <- assays(Slingshot)$counts
counts <- counts[rowSums(counts > 0) > 10, ]

fitGAM_BPPARAM <-fitGAM(counts=counts, sds=SlingshotDataSet(Slingshot), conditions = factor(colData(Slingshot)$Exp), nknots=6, verbose = T, parallel=T, BPPARAM = BPPARAM)