bomb out with lots of complaints if I/O worker dies
Opened this issue · 8 comments
If the I/O worker dies, this is a little hard for the end user to diagnose, as the solver workers carry on and fill up the log with messages. The error message is then buried somewhere mid-log and the whole process hangs waiting on I/O, instead of exiting with an error.
Surely a subprocess error is catchable at the main process level. #319 is related.
@o-smirnov Any fix/workaround for this yet? It's gotten me twice this weekend. I tried reducing --dist-ncpu, --dist-min-chunks
from 7 to 4 to no avail
INFO 19:42:07 - main [4.0/85.0 18.2/131.8 247.6Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
Traceback (most recent call last):
File "/home/CubiCal/cubical/main.py", line 582, in main
stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
File "/home/CubiCal/cubical/workers.py", line 226, in run_process_loop
return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
File "/home/CubiCal/cubical/workers.py", line 312, in _run_multi_process_loop
stats = future.result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I'm running into this BrokenProcessPool error with an oom-kill notice at the end of the log file - taking that to mean the system thinks I'll run out RAM at some point, so kills it. What I can't understand is that earlier in the log when it's calculating all the memory requirements, it says my max memory requirement will be ~57 GB - the system I'm running on has max 62 GB available, so I don't know why things are being killed.
I'm using --data-freq-chunk=256 (reduced down from 1024), --data-time-chunk=36, --dist-max-chunks=2, and ncpus=20 (the max available on the node). What other memory-related knobs can I twiddle to try solve this? It's only 2 hours of data, but running into the same issue with even smaller MSs as well.
The memory estimation is just that - a guess based on some empirical experiments I did. So take it with a pinch of salt. If it is an option, I would really suggest taking a look at QuartiCal. It is much less memory hungry, and has fewer knobs to boot. I am only too happy to help you out on that front.
That said, could you please post your log and config. That will help identify what is going wrong.
@JSKenyon running it as part of oxkat - guess we can have a chat about incorporating QuartiCal on an ad hoc basis. I'll take a look at it. But for now, here's the log and the parset
CL2GC912_cubical.zip
and the code run was
gocubical /data/knowles/mkatot/reruns/data/cubical/2GC_delaycal.parset --data-ms=1563148862_sdp_l0_1024ch_J0046.4-3912.ms --out-dir /data/knowles/mkatot/reruns/GAINTABLES/delaycal_J0046.4-3912_2022-03-01-10-17-13.cc/ --out-name delaycal_J0046.4-3912_2022-03-01-10-17-13 --k-save-to delaycal_J0046.4-3912.parmdb --data-freq-chunk=256
.
OK, in this instance I suspect it is just the fact that the memory footprint is underestimated. I think that the easiest solution in this instance is to set --dist-ncpu=3
. Simply put, each the memory footprint of each worker is just too large to use all the cores (or even 5+1 for I/O as in the log you sent). This is unfortunate and will make things slower. On a positive note, hopefully people will start onboarding QuartiCal which does much better in this regard. Apologies for not having a better solution for you.
Ok thanks, I'll give that a go.
The oxkat defaults are tuned so they work on standard worker nodes at IDIA and CHPC for standard MeerKAT continuum processing (assuming 1024 channel data). The settings should actually leave a fair bit of overhead to account for things like differing numbers of antennas, and the slurm / PBS controllers being quite trigger happy when jobs step out of line in terms of memory usage. But if you have a node with 64 GB of RAM then the defaults will certainly be too ambitious.
Is this running on hippo?
Also I'm not sure whether moving from a a single solution for the entire band (--data-freq-chunk = 1024
) to moving to four solutions across the band (--data-freq-chunk = 256
) will reduce the quality of your delay solutions, particularly for those quarter-band chunks that have high RFI occupancy. You might want to check if reverting to a 1024 channel solution gives better results. You could drop --dist-ncpu
further, and/or reduce --data-time-chunk
to accommodate this. Note that the latter is 36 by default, but that encompasses 9 individual solution intervals (--k-time-int 4
).
Cheers.
PS: @JSKenyon swapping to QuartiCal remains on my to-do list!