[Q]Why is Pile-ups of interactions between a set of regions is so time cosuming?

Question

[Q]Why is Pile-ups of interactions between a set of regions is so time cosuming?

jiangshan529 opened this issue 2 years ago · 21 comments

Hi, I am trying to run Pile-ups of interactions between a set of regions using bed file. I am using 16CPU and 60GB memory, however, for thousands of peaks from bed files, it has been run for 3 days and I still didn't get the result. Is there a way to increase the efficiency?

The code I am using is this:
coolpup.py aa.cool bb.bed --nshifts 10 --mindist 100000 --outname cc.txt --flank 30000 --n_proc 16 --clr_weight_name ""

Answer 1 · 2023-05-10T21:21:25.000Z

Hi,

Sorry you're having trouble. Without more details it's hard to know. Are you getting any output at all? Thousands of peaks is certainly not too much and shouldn't take long. Which version are you using?

Answer 2 · 2023-05-10T21:34:30.000Z

Hi,

Sorry you're having trouble. Without more details it's hard to know. Are you getting any output at all? Thousands of peaks is certainly not too much and shouldn't take long. Which version are you using?

Hi, Elias. I am using 1.0.0. the cool file is at 5kb resolution, and the bed file looks like this:
chr1 629947 629948
chr1 634029 634030
chr1 869978 869979
chr1 904778 904779
chr1 921225 921226

Another wierd thing I got when plotting local regions using this dataset, it gives some unexpected diagnol lines(as circled in red):

How should I deal with this? Thanks!

Answer 3 · 2023-05-10T21:39:33.000Z

For the stalled command, try running it with --nproc 1 and see if that gives you any errors. You can also try using --subset to use fewer regions to see if it's a speed issue or something else.

Regarding the lines in your plot, that's most likely coming from the data itself. So have a look at some regions you piled up in your cooler and see if you can spot anything weird.

Answer 4 · 2023-05-11T16:19:19.000Z

For the stalled command, try running it with --nproc 1 and see if that gives you any errors. You can also try using --subset to use fewer regions to see if it's a speed issue or something else.

Regarding the lines in your plot, that's most likely coming from the data itself. So have a look at some regions you piled up in your cooler and see if you can spot anything weird.

I am using 1 core and it has been run for overnight but I still didn't get the result
coolpup.py --features_format bed aa.cool bb.bed --outname cc_250pad.clpy --flank 100000 --n_proc 1 --clr_weight_name "" plotpup.py --input_pups cc_250pad.clpy --not_symmetric --output dd_dot.pdf

Answer 5 · 2023-05-11T16:49:34.000Z

It should never run for that long. In your command it looks like you are running coolpup.py and plotpup.py in the same script, which won't work (but maybe it's just a paste error). Otherwise, try the coolpup.py command but adding e.g. --subset 100 and see what happens. It should give results in minutes.

Answer 6 · 2023-05-11T18:04:19.000Z

It should never run for that long. In your command it looks like you are running coolpup.py and plotpup.py in the same script, which won't work (but maybe it's just a paste error). Otherwise, try the coolpup.py command but adding e.g. --subset 100 and see what happens. It should give results in minutes.

Hi, I used --subset 100. Now the speed is very fast, but the result looks weird. I am plotting a well-characterized hic data on CTCF-centered bed file, there should be an enrichment in the center. However, the heatmap is very noisy.

Answer 7 · 2023-05-11T18:09:35.000Z

With subset of 100 it's expected to be noisy. How many regions do you have in your dataset?

Answer 8 · 2023-05-11T18:23:31.000Z

With subset of 100 it's expected to be noisy. How many regions do you have in your dataset?

5600 peaks.

Answer 9 · 2023-05-12T07:53:52.000Z

5600 should not take that long, totally reasonable size. Should be on the order of minutes, not days. Maybe try increasing subset gradually and see how it changes the time for you?

Answer 10 · 2023-05-13T16:38:48.000Z

5600 should not take that long, totally reasonable size. Should be on the order of minutes, not days. Maybe try increasing subset gradually and see how it changes the time for you?

Hi, when I run --subset 1000, it runs 3min; --subset 2000, 20min; --subset 3000, 3h. I am using 16 cores. It's so wierd, do you have any ideas that I can solve it?

Answer 11 · 2023-05-13T16:40:38.000Z

Well, to be honest, you probably want to set --maxdist to something around 1Mb, if this are CTCF sites... That will speed it up a lot.

Answer 12 · 2023-05-15T00:53:24.000Z

coolpup.py aa.cool bb.bed --nshifts 10 --mindist 100000 --outname cc.txt --flank 30000 --n_proc 16 --clr_weight_name ""

Hi, I tried "--mindist 1000 --maxdist 1000000 --n_proc 8" on CTCF cites(60000 peaks). It's been over one day, I still didn't get a result.

Answer 13 · 2023-05-15T02:58:48.000Z

plotpup.py --input_pups cc_250pad.clpy --not_symmetric --output dd_dot.pdf

when I run --local, it's really in minutes. However, when to run not local, I cannot get a result within several days.

Answer 14 · 2023-05-15T06:20:58.000Z

Can you try using expected instead of shifts? Or just 1 shift instead of 10?
And 60_000 peaks is a lot, so it would take time, but not days...

Answer 15 · 2023-05-19T16:02:05.000Z

Can you try using expected instead of shifts? Or just 1 shift instead of 10? And 60_000 peaks is a lot, so it would take time, but not day

--maxdist 3000000 --nshift 1 really works(16 cores take 9h to compute 60000 peaks)! And I think to set maxdist to 3Mb makes sense, there's no need to compute all of the peaks.

Answer 16 · 2023-05-19T16:48:05.000Z

I think for such large sets it's better to normalize by expected (see the tutorial) instead of nshifts. Good that it works though!

Answer 17 · 2023-05-19T17:02:11.000Z

I think for such large sets it's better to normalize by expected (see the tutorial) instead of nshifts. Good that it works though!

Hi, is expected matrix calculated by cooltools? cooltools expected-cis --view hg38_arms.bed -p 2 -o test_expected_cis.tsv test.mcool::resolutions/5000

Answer 18 · 2023-05-19T17:15:52.000Z

I think for such large sets it's better to normalize by expected (see the tutorial) instead of nshifts. Good that it works though!

Hi, is expected matrix calculated by cooltools? cooltools expected-cis --view hg38_arms.bed -p 2 -o test_expected_cis.tsv test.mcool::resolutions/5000

by the way, what is hg38_arms.bed

Answer 19 · 2023-05-19T17:17:42.000Z

https://cooltools.readthedocs.io/en/latest/notebooks/contacts_vs_distance.html#

Answer 20 · 2023-05-19T17:19:14.000Z

https://cooltools.readthedocs.io/en/latest/notebooks/contacts_vs_distance.html#

Thanks for your prompt response!

Answer 21 · 2023-07-05T09:03:38.000Z

Hope this is resolved now, feel free to reopen.