bomb out with lots of complaints if I/O worker dies

Question

bomb out with lots of complaints if I/O worker dies

Opened this issue 4 years ago · 8 comments

If the I/O worker dies, this is a little hard for the end user to diagnose, as the solver workers carry on and fill up the log with messages. The error message is then buried somewhere mid-log and the whole process hangs waiting on I/O, instead of exiting with an error.

Surely a subprocess error is catchable at the main process level. #319 is related.

Answer 1 · 2021-09-19T19:51:06.000Z

@o-smirnov Any fix/workaround for this yet? It's gotten me twice this weekend. I tried reducing --dist-ncpu, --dist-min-chunks from 7 to 4 to no avail

INFO      19:42:07 - main               [4.0/85.0 18.2/131.8 247.6Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
 Traceback (most recent call last):
  File "/home/CubiCal/cubical/main.py", line 582, in main
    stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
  File "/home/CubiCal/cubical/workers.py", line 226, in run_process_loop
    return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
  File "/home/CubiCal/cubical/workers.py", line 312, in _run_multi_process_loop
    stats = future.result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Answer 2 · 2021-09-20T04:56:42.000Z

Decrease the chunk size and set --dist-max-chunks instead of --dist-min-chunks. It will run in serial but it will reduce the footprint.

…

On Sun, Sep 19, 2021 at 9:51 PM Lexy Andati ***@***.***> wrote: @o-smirnov <https://github.com/o-smirnov> Any fix/workaround for this yet? It's gotten me twice this weekend. I tried reducing --dist-ncpu, --dist-min-chunks from 7 to 4 to no avail INFO 19:42:07 - main [4.0/85.0 18.2/131.8 247.6Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.) Traceback (most recent call last): File "/home/CubiCal/cubical/main.py", line 582, in main stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts) File "/home/CubiCal/cubical/workers.py", line 226, in run_process_loop return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts) File "/home/CubiCal/cubical/workers.py", line 312, in _run_multi_process_loop stats = future.result() File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result return self.__get_result() File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#439 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4RE6SDTUK7DLPJOQZH72DUCY5LLANCNFSM4YS4AFXA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- -- Benjamin Hugo PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

Answer 3 · 2022-03-02T09:49:49.000Z

I'm running into this BrokenProcessPool error with an oom-kill notice at the end of the log file - taking that to mean the system thinks I'll run out RAM at some point, so kills it. What I can't understand is that earlier in the log when it's calculating all the memory requirements, it says my max memory requirement will be ~57 GB - the system I'm running on has max 62 GB available, so I don't know why things are being killed.

I'm using --data-freq-chunk=256 (reduced down from 1024), --data-time-chunk=36, --dist-max-chunks=2, and ncpus=20 (the max available on the node). What other memory-related knobs can I twiddle to try solve this? It's only 2 hours of data, but running into the same issue with even smaller MSs as well.

Answer 4 · 2022-03-02T10:07:59.000Z

The memory estimation is just that - a guess based on some empirical experiments I did. So take it with a pinch of salt. If it is an option, I would really suggest taking a look at QuartiCal. It is much less memory hungry, and has fewer knobs to boot. I am only too happy to help you out on that front.

That said, could you please post your log and config. That will help identify what is going wrong.

Answer 5 · 2022-03-02T10:21:26.000Z

@JSKenyon running it as part of oxkat - guess we can have a chat about incorporating QuartiCal on an ad hoc basis. I'll take a look at it. But for now, here's the log and the parset
CL2GC912_cubical.zip

and the code run was
gocubical /data/knowles/mkatot/reruns/data/cubical/2GC_delaycal.parset --data-ms=1563148862_sdp_l0_1024ch_J0046.4-3912.ms --out-dir /data/knowles/mkatot/reruns/GAINTABLES/delaycal_J0046.4-3912_2022-03-01-10-17-13.cc/ --out-name delaycal_J0046.4-3912_2022-03-01-10-17-13 --k-save-to delaycal_J0046.4-3912.parmdb --data-freq-chunk=256
.

Answer 6 · 2022-03-02T11:07:04.000Z

OK, in this instance I suspect it is just the fact that the memory footprint is underestimated. I think that the easiest solution in this instance is to set --dist-ncpu=3. Simply put, each the memory footprint of each worker is just too large to use all the cores (or even 5+1 for I/O as in the log you sent). This is unfortunate and will make things slower. On a positive note, hopefully people will start onboarding QuartiCal which does much better in this regard. Apologies for not having a better solution for you.

Answer 7 · 2022-03-02T11:15:00.000Z

Ok thanks, I'll give that a go.

Answer 8 · 2022-03-02T17:45:32.000Z

The oxkat defaults are tuned so they work on standard worker nodes at IDIA and CHPC for standard MeerKAT continuum processing (assuming 1024 channel data). The settings should actually leave a fair bit of overhead to account for things like differing numbers of antennas, and the slurm / PBS controllers being quite trigger happy when jobs step out of line in terms of memory usage. But if you have a node with 64 GB of RAM then the defaults will certainly be too ambitious.

Is this running on hippo?

Also I'm not sure whether moving from a a single solution for the entire band (--data-freq-chunk = 1024) to moving to four solutions across the band (--data-freq-chunk = 256) will reduce the quality of your delay solutions, particularly for those quarter-band chunks that have high RFI occupancy. You might want to check if reverting to a 1024 channel solution gives better results. You could drop --dist-ncpu further, and/or reduce --data-time-chunk to accommodate this. Note that the latter is 36 by default, but that encompasses 9 individual solution intervals (--k-time-int 4).

Cheers.

PS: @JSKenyon swapping to QuartiCal remains on my to-do list!