desihub/nightwatch

investigate if/how multiprocessor.Pool is blocking on desi-7

sybenzvi opened this issue · 2 comments

At the start of May following a suggestion by @daqiii, we have allowed the Nightwatch monitor to grab 30 processes on desi-7 in an attempt to speed up the processing of exposures. It worked, but we noticed that many more exposures now take >10 minutes to process.

nw_desi7_timeseries

@jose-bermejo investigated and found that 75% of these delays occurred in the first zero of the night, which suggests the call to multiprocessing.Pool.map in nightwatch/run.py may be having trouble allocating the 30 processes at first and is blocking. If right, there may be a couple of solutions:

Thanks for this detailed study, including paying attention to the bad side outliers and not just the improvement to the best-case performance. If the red line in the plot indicates when the 15 -> 30 processes change was made, it appears that the best-case performance was improved by some other change during the end-of-April outage, rather than by increasing the process count. i.e. consider whether additional development to use 30 processes on 24 physical cores is worth the human time. Also consider switching back to 15 processes for a few days to see if the median- and best-case timing moves higher again, or whether we are just chasing the wrong cause.

FTR, an original motivation for 15 processes was to also leave some cores available for non-nightwatch work (like the webserver or interactive work); that may be less of an issue nowadays.

Commenting further on this issue: the move to desi-8 in July 2023 and our ability to run on 30 cores at once has cured most of the slowdowns.

However, @jose-bermejo discovered that the first exposure of each night tends to take a significant amount of time to process. The bottleneck appears to occur when Nightwatch regenerates the exposure table by globbing the full list of previously processed exposures in the filesystem. Since there are hundreds of thousands of exposure folders this is extremely slow, sometimes taking more than 10 minutes to complete. We will close this ticket and open a new one to address that specific issue.