File locking blocks indefinitely in `writePileUps`
a-ludi opened this issue · 15 comments
Hi Arne,
[...] The job seems to be stuck in “dentist collect”. The file “workdir/pile-ups.db” was created but it is empty. The node that it is running on shows that 71G of memory is used and 52G available.
Here are the last entries in the “collect.log” file.
{"thread":140737354013504,"timestamp":637255135504311334,"numPileUps":260,"numAlignmentChains":3186}
{"thread":140737354013504,"timestamp":637255135504323016,"state":"exit","function":"dentist.commands.collectPileUps.PileUpCollector.buildPileUps","timeElapsed":25742682}
{"thread":140737354013504,"timestamp":637255135504332559,"state":"enter","function":"dentist.commands.collectPileUps.PileUpCollector.writePileUps"}Do you have any ideas? Could there be a problem with the pipeline before it hit dentist collect? Watching it run is a thing of beauty!
Regards,
Randy
Originally posted by @BradleyRan in #3 (comment)
Hi Randy,
I don't have a clue so far. It looks like some kind of memory leak in the writer for the binary pileups format. Here are some questions to get some light into the dark:
- Is the memory consumption rising over time?
- How big is the input
.las
files?numAlignmentChains
is just 3186 so I guess it's rather small. - Is it possible to share the input files? If yes, please drop them into my ownCloud so I can try reproducing the bug.
- Can you please share the full command that is being executed (you may remove sensitive names from the paths)? You can make snakemake report the command by running it with options
-np
.
Sure, perfect!
Hey Randy,
got the files and they are OK. I will write to you once I have news.
So, in my setup it worked. It took 08:55 hours with 2 CPUs and consumed max. RSS of 86.5 GB. Most of the time (6.7h) was spent reading the large alignment file. The routine is not very optimized, I have to admit.
Now that you know how many resources are required, can you try again? I ran the job with 24h time limit and max allowed RSS of 128G.
Hi Randy,
the write-protection is likely not the cause of your issue because it is made by snakemake after dentist finished successfully. You may double-check this by verifying if dentist is still active.
I guess it might be file-locking: dentist tries to lock files it reads or writes via flockfile
. If an error occurs (like it does on our cluster because file locking is not implemented) it will just open the file without locking and continue. In contrast, it will just get stuck if the file lock cannot be acquired for some reason.
I created another version of dentist that allows skipping the locking step entirely. I hope it just works because I did not test it at all.
Sorry, I forgot to mention: SKIP_FILE_LOCKING=1
should do the trick.