[Q]Why is Pile-ups of interactions between a set of regions is so time cosuming?
jiangshan529 opened this issue · 21 comments
Hi, I am trying to run Pile-ups of interactions between a set of regions using bed file. I am using 16CPU and 60GB memory, however, for thousands of peaks from bed files, it has been run for 3 days and I still didn't get the result. Is there a way to increase the efficiency?
The code I am using is this:
coolpup.py aa.cool bb.bed --nshifts 10 --mindist 100000 --outname cc.txt --flank 30000 --n_proc 16 --clr_weight_name ""
Hi,
Sorry you're having trouble. Without more details it's hard to know. Are you getting any output at all? Thousands of peaks is certainly not too much and shouldn't take long. Which version are you using?
Hi,
Sorry you're having trouble. Without more details it's hard to know. Are you getting any output at all? Thousands of peaks is certainly not too much and shouldn't take long. Which version are you using?
Hi, Elias. I am using 1.0.0. the cool file is at 5kb resolution, and the bed file looks like this:
chr1 629947 629948
chr1 634029 634030
chr1 869978 869979
chr1 904778 904779
chr1 921225 921226
Another wierd thing I got when plotting local regions using this dataset, it gives some unexpected diagnol lines(as circled in red):
How should I deal with this? Thanks!
For the stalled command, try running it with --nproc 1 and see if that gives you any errors. You can also try using --subset to use fewer regions to see if it's a speed issue or something else.
Regarding the lines in your plot, that's most likely coming from the data itself. So have a look at some regions you piled up in your cooler and see if you can spot anything weird.
For the stalled command, try running it with --nproc 1 and see if that gives you any errors. You can also try using --subset to use fewer regions to see if it's a speed issue or something else.
Regarding the lines in your plot, that's most likely coming from the data itself. So have a look at some regions you piled up in your cooler and see if you can spot anything weird.
I am using 1 core and it has been run for overnight but I still didn't get the result
coolpup.py --features_format bed aa.cool bb.bed --outname cc_250pad.clpy --flank 100000 --n_proc 1 --clr_weight_name "" plotpup.py --input_pups cc_250pad.clpy --not_symmetric --output dd_dot.pdf
It should never run for that long. In your command it looks like you are running coolpup.py and plotpup.py in the same script, which won't work (but maybe it's just a paste error). Otherwise, try the coolpup.py command but adding e.g. --subset 100 and see what happens. It should give results in minutes.
It should never run for that long. In your command it looks like you are running coolpup.py and plotpup.py in the same script, which won't work (but maybe it's just a paste error). Otherwise, try the coolpup.py command but adding e.g. --subset 100 and see what happens. It should give results in minutes.
Hi, I used --subset 100. Now the speed is very fast, but the result looks weird. I am plotting a well-characterized hic data on CTCF-centered bed file, there should be an enrichment in the center. However, the heatmap is very noisy.
With subset of 100 it's expected to be noisy. How many regions do you have in your dataset?
With subset of 100 it's expected to be noisy. How many regions do you have in your dataset?
5600 peaks.
5600 should not take that long, totally reasonable size. Should be on the order of minutes, not days. Maybe try increasing subset gradually and see how it changes the time for you?
5600 should not take that long, totally reasonable size. Should be on the order of minutes, not days. Maybe try increasing subset gradually and see how it changes the time for you?
Hi, when I run --subset 1000, it runs 3min; --subset 2000, 20min; --subset 3000, 3h. I am using 16 cores. It's so wierd, do you have any ideas that I can solve it?
Well, to be honest, you probably want to set --maxdist to something around 1Mb, if this are CTCF sites... That will speed it up a lot.
coolpup.py aa.cool bb.bed --nshifts 10 --mindist 100000 --outname cc.txt --flank 30000 --n_proc 16 --clr_weight_name ""
Hi, I tried "--mindist 1000 --maxdist 1000000 --n_proc 8" on CTCF cites(60000 peaks). It's been over one day, I still didn't get a result.
plotpup.py --input_pups cc_250pad.clpy --not_symmetric --output dd_dot.pdf
when I run --local, it's really in minutes. However, when to run not local, I cannot get a result within several days.
Can you try using expected instead of shifts? Or just 1 shift instead of 10?
And 60_000 peaks is a lot, so it would take time, but not days...
Can you try using expected instead of shifts? Or just 1 shift instead of 10? And 60_000 peaks is a lot, so it would take time, but not day
--maxdist 3000000 --nshift 1 really works(16 cores take 9h to compute 60000 peaks)! And I think to set maxdist to 3Mb makes sense, there's no need to compute all of the peaks.
I think for such large sets it's better to normalize by expected (see the tutorial) instead of nshifts. Good that it works though!
I think for such large sets it's better to normalize by expected (see the tutorial) instead of nshifts. Good that it works though!
Hi, is expected matrix calculated by cooltools? cooltools expected-cis --view hg38_arms.bed -p 2 -o test_expected_cis.tsv test.mcool::resolutions/5000
I think for such large sets it's better to normalize by expected (see the tutorial) instead of nshifts. Good that it works though!
Hi, is expected matrix calculated by cooltools? cooltools expected-cis --view hg38_arms.bed -p 2 -o test_expected_cis.tsv test.mcool::resolutions/5000
by the way, what is hg38_arms.bed
https://cooltools.readthedocs.io/en/latest/notebooks/contacts_vs_distance.html#
Thanks for your prompt response!
Hope this is resolved now, feel free to reopen.