Reducing dataset size to improve run time

Question

Reducing dataset size to improve run time

fluentin44 opened this issue 2 years ago · 4 comments

Hi,

I have a dataset of ~25k cells and 130 samples so computation time and memory to run fitGAM are going to be an issue for me. With respect to that I have seen reccomendations to reduce the number of genes put into the function just to the top 2k variable features, however can I clarify - is that reducing the whole counts matrix down to 2k features, or keeping the whole counts matrix and putting the names of the top 2k variable features into the genes argument?

Thanks,
Matt

Answer 1 · 2022-12-08T17:23:38.000Z

Hi @fluentin44

We generally recommend supplying the entire count matrix (if possible given memory requirements) and then supply the genes you would like to fit using the genes argument.
This way, we still use the entire count matrix for normalization.

Hope this helps.

Answer 2 · 2022-12-09T07:54:02.000Z

Ok much appreciated!

Thanks,
Matt

Answer 3 · 2023-03-07T13:00:21.000Z

Hi,
Great tool!
I'd ask a couple questions related.

-Focusing on 2kgenes, it seems that subsetting counts prior to fitGAM and provoding full counts with genes=2kgenes gives different results, is that possible? Is there anything else happening a side from normalization which includes informations from other genes during the fitting?

-In the case that highly variable genes are scored as a consequence of capturing differences among lineages, wouldn't be that a source of bias during normalization?

Thanks a lot

Answer 4 · 2023-12-04T15:12:46.000Z

Hi @castaway1990

Focusing on 2kgenes, it seems that subsetting counts prior to fitGAM and provoding full counts with genes=2kgenes gives different results, is that possible? Is there anything else happening a side from normalization which includes informations from other genes during the fitting?

Yes, that is possible. If you first subset the 2K genes and then run fitGAM, the normalization will only use the 2K genes to estimate normalization factors. Instead, if you provide the full count matrix and use the genes argument to identify the 2K genes you would like to fit, then the normalization will still use all genes to calculate the normalization factor. This should be the only difference.

In the case that highly variable genes are scored as a consequence of capturing differences among lineages, wouldn't be that a source of bias during normalization?

If there are large systematic differences between the groups you are comparing, this can indeed be an issue in normalization. In tradeSeq, we are relying on TMM normalization as described here. One of the main assumptions is that the majority of genes are not differentially expressed. I would advice against only providing the subsetted count matrix to fitGAM and instead would recommend to provide the full count matrix and use the genes argument to specify the genes you're interested in.