ratt-ru/QuartiCal

Segmentation fault (core dumped)

Closed this issue · 5 comments

Hi,

I'm in the process of switching from CC to QC. Right now, I'm trying to match past CC self-calibration to check that I'm getting similar image fidelity/improvement.

I'm getting continuous segmentation faults that kill my QC runs. Oddly, these seem to be stochastic; i.e., sometimes the command will execute fully, but most of the time, it will kill the script. QC should have no problems running this script as I could run it in CC (on the same machine) without issues (the only difference is I've changed the f-slope solver to delay_and_offset).

I'm not exactly sure what information/documents would help debug this issue, but if you let me know what you need to reproduce the error, I can provide it.

(Virtual) Machine specs:
64 Gb, 8 core

Data:
S-band VLA data (i.e., 2 x 8 SPW basebands, each with 512 -- 2 MHz -- channels)

Hi @AKHughes1994! Sorry that you seem to have run into a bug - if it seems stochastic it may be thread safety related. Could you please share both your log file and your QuartiCal config file/command line?

Hi @JSKenyon I've attached the log file + .yaml file,

The command I run is,

goquartical ../quartical_parsets/DI_bb.yaml input_ms.path=ms.ms input_ms.select_ddids=[8,9,10,11,12,13,14,15] input_ms.freq_chunk=512 K.freq_interval=512

DI_bb.txt
20230829_194411.log.qc.txt

Ok, I can reproduce on an arbitrary dataset which suggests it is a bug in the code and not some peculiarity in the data. Will drill down and find it.

I believe I have found the problem - could you please unset output.subtract_directions? Please let me know if that works for you, as it seems to resolve the segfaults (due to out of bounds access) for me.

Thanks for the bug report - I will put in a check to ensure this doesn't trouble anyone else.

Edit: Just to clarify, the problem is that the corrected residual code is attempting to subtract direction 1 which doesn't actually exist in this case. This leads to an out-of-bounds access which may or may not cause a segfault. The solution is to check that all values in output.subtract_directions correspond to real directions. This can be done in the dask layer of the residual computation.

Ahhhhh!

I modified a DD yaml file into a DI yaml file and absolutely should have caught that issue. Apologies.

Thanks for finding it,
Andrew