memory shortage after finalizing the model fitting

Question

memory shortage after finalizing the model fitting

Abeer-hes opened this issue 2 years ago · 6 comments

Hello @tradeSeq,

I am running fitGAM on dataset of ~18k genes, 27k cells, 3 lineages using 6 knots. After long processing time, the function gives at the end of the run "can't locate a vector of size 3.7 gb", even when I free that memory but not as continuous block. Since there was no issue while processing and only when writing the output, I tried to book the memory using a mock sce object occupying the same expected memory to be overwritten with the real model. It woks for small subsets of the genes that didn't work before (as long the final sce size is 1 Gb) but not for the output of all genes. What would you recommend to work around the output issue?
At the moment I couldn't to run the function on the HPC terminal (OS Centos7) or mange to overcome the incompatibility issues with the R available on the cluster (having "illegal instruction" or "illegal operand" errors with evaluteK and fitGAM), so unfortunately an alternative is not available yet.

Apologies in advance for the simplistic questions, and much appreciation for any helpful input.

Best,

Answer 1 · 2023-01-03T08:32:34.000Z

Hi @Abeer-hes, did you use parallelization in the fitting? I think that memory issues can happen when trying to combine results from all workers being used.
As you are also trying, the better solution would be to increase memory size to a large value such that you will be able to store the results. You are likely to need considerably more than the mentioned 3.7Gb. If this does not work you could, e.g. try fitting half of the genes twice using the genes argument (using the genes argument ensures that the same normalization is used for all genes), and then create a new SCE that combines the results of both. The fitting results are stored in rowData(sce)$tradeSeq, and there are also results stored in metadata(sce) and colData(sce).

Answer 2 · 2023-01-03T16:29:45.000Z

Thank you so much @koenvandenberge for the informative response.
I think the fastest way to go is to try to combine the final sce by binding the different components.

Best,

Answer 3 · 2023-01-11T23:12:14.000Z

Hello @koenvandenberge
Sorry for asking again, but I tried to combine the parts (gene-subsetted) of the full data using SingleCellExperiment::rbind. I am facing the issue of combining colData(sce)$tradeSeq since the different parts have slightly different models.

The code that works so far:

#saving tradeSeq from colData
tradeSeq_colData1 <- sce1$tradeSeq
tradeSeq_colData2 <- sce2$tradeSeq

#I have to neutralize tradeseq in colData:
sce1$tradeSeq <- 0
sce2$tradeSeq <- 0

sce <-  SingleCellExperiment::rbind(sce1, sce2)

#naively adding the tradeseq from colData of only one part
sce$tradeSeq <- tradeSeq_colData1

or else the error I have as the factors are not the same:

Error in FUN(X[[i]], ...) : 
  column(s) 'tradeSeq' in ‘colData’ are duplicated and the data do not match

However, the values in the dm and X are different as expected. Any advice how to move forward?!!

Best,

Answer 4 · 2023-02-16T08:08:53.000Z

Hi @koenvandenberge,

I'm also running into memory shortage, especially when including conditions to the model fitting (20,000 cells).
I like the approach of running fitGAM() sequential over parts of the count matrix, but am also not too confident putting the results back together. Would be great if you could share a quick example of how to correctly joining them.

Also, since I am using the condition parameter, would it be possible to subset each condition prior to running fitGAM() for each condition separately, thus reducing the memory per run, and then putting the results back together?

Best,
Florian

Answer 5 · 2023-03-28T06:33:11.000Z

Hi all,

It is crucial to set a seed prior to running fitGAM to ensure the same assignment of cells to lineages. Once you do that, a simple rbind does the trick. Below you can find code using the example data in the package.

library(SingleCellExperiment)
library(tradeSeq)

## all data
data(crv, package="tradeSeq")
data(countMatrix, package="tradeSeq")
set.seed(3)
sceGAM <- fitGAM(counts = as.matrix(countMatrix),
                  sds = crv,
                  nknots = 5)

## note setting seed prior to running fitGAM
## is needed to ensure same assignment of cells to lineages
set.seed(3)
sceGAM1 <- fitGAM(counts = as.matrix(countMatrix),
                 sds = crv,
                 nknots = 5,
                 genes=1:120)
set.seed(3)
sceGAM2 <- fitGAM(counts = as.matrix(countMatrix),
                  sds = crv,
                  nknots = 5,
                  genes=121:240)

dm1 <- sceGAM1$tradeSeq$dm
dm2 <- sceGAM2$tradeSeq$dm
all.equal(dm1,dm2)

X1 <- sceGAM1$tradeSeq$dm
X2 <- sceGAM2$tradeSeq$dm
all.equal(X1,X2)

sceGAMAll <- rbind(sceGAM1, sceGAM2)

all.equal(unname(as.matrix(rowData(sceGAMAll)$tradeSeq$beta)), unname(as.matrix(rowData(sceGAM)$tradeSeq$beta)))

Answer 6 · 2023-06-20T13:07:41.000Z

Many thanks @koenvandenberge for the reply, it solved the issue with running large fitting.
I think I was using set.seed for the whole R session. I am using it per each fitGAM run which works for unifying the assignment of cells and seems to be the needed trick.

There are scenarios where it might be useful to have random seed, but I was wondering for such use of the function (combining/changing the different subset (e.g. genes) of the same original data or sce), to make setting the seed a parameter in the function (similar to the gene parameter).

Thanks again.
Best,