buenrostrolab/FigR

Determine pairs through optimized bipartite matching ..

Opened this issue · 16 comments

Hello, I encountered this problem while implementing FigR::pairCells function in order to achieve cell matching between scRNA dataset containing ~5000 cells and scATAC dataset containing ~5000 cells. My Rstudio hasn't given any feedback after showing 'Determing pairs through optimized bipartite matching' .. for several days. I'm not able to confirm whether it will succeed eventually. Is this step that time-consuming? What can I do to speed it up or check its running state?

I got the same problem. and didn`t get any solution now. Did you solved?

I have tried re-install, nothing changes...

Hey bro, I think it could be something wrong with this package. I HAVE tried on different PC or Linux system, but the same error like you said stopping at 'Determing pairs through optimized bipartite matching'

Hi there - apologies for the delayed response on this issue. I'll look into this shortly, that is a bit strange and I haven't encountered that issue in the past (others have been able to run the same function through as well, but there are some constraints on params specified wrt the size of the input dataset which maybe we can figure out here). Few questions, do all the logs print OK up until that point? (e.g. number of cells being paired, subgraph size, search threshold params etc.?). Also, can you try only running this using the first 100 or 500 cells of your 5,000 cell dataset (toy test), to make sure you can get from start to end, and it is the match step where it is getting stuck / taking too long when applied to your full dataset (scaling issue)?

The logs printed before that point is as follows:
Constructing KNN graph for computing geodesic distance ..
Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
Also defined by ‘spam’
Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
Also defined by ‘spam’
Computing graph-based geodesic distance ..

KNN subgraphs detected:

1
Skipping subgraphs with either ATAC/RNA cells fewer than: 50 ..
Pairing cells for subgraph No. 1
Total ATAC cells in subgraph: 4904
Total RNA cells in subgraph: 4904
Subgraph size: 4904
Search threshold being used: 1962
[1] "Constructing KNN based on geodesic distance to reduce search pairing search space"
[1] "Number of cells being paired: 4904 ATAC and 4904 RNA cells"
I do have tried perform this function using 1000k scRNA-seq cells and 1000k scATAC-seq cells but nothing changed. And my computer system is ubuntu 22.04 LTS. Looking forward to your further response.

Can you add the following parameter to your FigR::pairCells call?

search_range=0.05

And try again? Essentially I think it's taking a very long time because the number of possible options it has to iterate over is very large. Reducing that search space (default 0.2*number of cells) might help here, so I suggested trying with a smaller fraction. Let me know what you see

How long will this step take ordinarily using less than ten thousands cells? Did you change that parameter when running the pipeline? Thanks!!!

I guess the problem is caused by optmatch, actually the step of optmatch::fullmatch in

cell_matches <- suppressWarnings(optmatch::fullmatch(optmatch::as.InfinitySparseMatrix(as.matrix(geodist_knn)),

I run the code step by step with your toy data, and the same like before:

Have you resolved it? I have been stuck in this step for half a month.

Hi there - were you able to try my above suggestion of changing the default search_range parameter?

You can specify this directly in the pairCells call, in addition to your ATAC/RNA input, something like:


cellPairing <- pairCells(ATAC=myATAC_CC,
                                     RNA=myRNA_CC,
                                     search_range=0.05,
                                     keepUnique=TRUE)

The reason that fullmatch step is taking long is because you're evaluating over a very large number of possible pairs, so I had suggested reducing that using the search_range parameter (default is 0.2*num cells, so try reducing to 0.05 or 0.01 and see if it helps speed it up).

Hi there - were you able to try my above suggestion of changing the default search_range parameter?

You can specify this directly in the pairCells call, in addition to your ATAC/RNA input, something like:


cellPairing <- pairCells(ATAC=myATAC_CC,
                                     RNA=myRNA_CC,
                                     search_range=0.05,
                                     keepUnique=TRUE)

The reason that fullmatch step is taking long is because you're evaluating over a very large number of possible pairs, so I had suggested reducing that using the search_range parameter (default is 0.2*num cells, so try reducing to 0.05 or 0.01 and see if it helps speed it up).

by increase the search_range, the function fullmatch can partly run successfully.
BUT there is still some trouble with the toy data. With the same setting like before tol = 0.0001, max_multimatch = 5, the second subgraph of chunk1 after fullmatch is all NA, and the get_pair_list got a new error because length(cell_matches)=0
the error like below:

I try to decrease the min.controls (same like increase max_multimatch), and it solved. I wonder the operation is right or not?
or should I modify the code to skip when the matchlist gets all NA
my sessionIfo as below:

R version 4.2.2 (2022-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /xxxxx/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] FigR_0.1.0                  motifmatchr_1.20.0
 [3] chromVAR_1.20.2             cowplot_1.1.1
 [5] dplyr_1.1.0                 SummarizedExperiment_1.28.0
 [7] Biobase_2.58.0              GenomicRanges_1.50.0
 [9] GenomeInfoDb_1.34.8         IRanges_2.32.0
[11] S4Vectors_0.36.0            BiocGenerics_0.44.0
[13] MatrixGenerics_1.10.0       matrixStats_0.63.0
[15] BuenColors_0.5.6            ggplot2_3.4.1
[17] MASS_7.3-58.2               doParallel_1.0.17
[19] iterators_1.0.14            foreach_1.5.2
[21] uwot_0.1.14                 Matrix_1.5-3
[23] pracma_2.4.2                FNN_1.1.3.1
[25] igraph_1.4.1
..........

Additionally, when increase max_multimatch to 6,the subgraph8 also get same error caused by fullmatch get all NA

is the problem caused by memory overflow? (btw,I run with 200GB) if I incerased thesearch_range, the system definitly crash! so I want to know the required of memory and CPU in your operation

I have already tried to downsize the search_range parameter just as you suggested, but it still doesn't work~😭

Additionally, when increase max_multimatch to 6,the subgraph8 also get same error caused by fullmatch get all NA

is the problem caused by memory overflow? (btw,I run with 200GB) if I incerased thesearch_range, the system definitly crash! so I want to know the required of memory and CPU in your operation

Have you fixed that problem, bro?

Hi, @vkartha I have encountered the same issue. I first tried using the search_range = 0.05 argument in pairCells, and it got to the third chunk (the RNA-seq and ATAC-seq samples are downsampling to ~5000 cells each) with about half of the cells run in the first and second chunks. However, after getting to the third chunk, it was left at the Determine pairs through optimized bipartite matching for an indeterminate amount of time. I stopped my session and used 0.01 instead, but this did not help and the function held at the first chunk.

Thanks so much for looking at this and providing suggestions.