artis-mcrt/artis

GSL integration error handling

Closed this issue · 6 comments

Hi everyone,

Unfortunately I've hit some issues with the gsl integration error handling. I was running the simulation as Luke suggested on the nebular branch with the ICs up to date as of last thursday. I believe we were intending to get to the 35th timestep, however after the 27th it crashed because of the gsl error. The error message I get is:

gsl: qag.c:248: ERROR: roundoff error prevents tolerance from being achieved
Default GSL error handler invoked.

sn3d:381335 terminated with signal 6 at PC=2b3b599df207 SP=7ffd412491d8. Backtrace:
/lib64/libc.so.6(gsignal+0x37)[0x2b3b599df207]
/lib64/libc.so.6(abort+0x148)[0x2b3b599e08f8]
/usr/local/Cluster-Apps/gsl/2.4/lib/libgsl.so.23(+0x815ce)[0x2b3b5538f5ce]
/usr/local/Cluster-Apps/gsl/2.4/lib/libgsl.so.23(gsl_integration_qag+0xb8c)[0x2b3b553abddc]
sn3d[0x41fb60]
sn3d[0x456b8f]
sn3d[0x40477e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3b599cb3d5]
sn3d[0x402c69]

I see that some invocations of gsl_integration_qag are surrounded by error handlers, while others aren't. I guess that this tripped some one of the ones which isn't, and the default handler causes an abort. Not sure if any of you have seen this crash happening before, from previous correspondence with Luke I assume it is acceptable to use the error handling approach similar to e.g. near line 1239 of ratecoeff.c. Is this the case, and if so, should I change the other error handlers?

ssim commented

Hello!

@lukeshingles have you seen this particular one before? Maybe there's an issue with statistics or something here?

Can we tell exactly where in the cycle this is happening - what was the tail of the output?.txt file when it happened?

It looks like nne was going infinite - at least on my own test runs. I believe the same problem came up with Christine's runs at early times?
I'm modifying the input files until I can get a working simulation on raijin.

Sorry, didn't see Stuart's comment this morning. Maybe github was acting up for some reason? I did get an email with Luke's update though.

Here are the last lines from each of the output files if it helps. Is this likely to be something to do with the input files?

tail -n 1 output_0-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 62 Z=28 ionstage 2 lower 231 phixstargetindex 2 gamma 1.35845e+06 error 329.377
tail -n 1 output_10-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 59 Z=28 ionstage 2 lower 121 phixstargetindex 1 gamma 1.33617e+07 error 3446.32
tail -n 1 output_1-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 63 Z=28 ionstage 2 lower 322 phixstargetindex 1 gamma 1.26663e+07 error 3072.2
tail -n 1 output_11-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 52 Z=28 ionstage 2 lower 429 phixstargetindex 2 gamma 7.39784e+07 error 16298.2
tail -n 1 output_12-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_13-0.txt
NTLEPTON packet selected in cell 40 excitation of Z=28 ionstage 3 level 0 upperlevel 64
tail -n 1 output_14-0.txt
NTLEPTON packet selected in cell 54 excitation of Z=28 ionstage 3 level 1 upperlevel 85
tail -n 1 output_15-0.txt
NTLEPTON packet selected in cell 37 excitation of Z=28 ionstage 3 level 2 upperlevel 113
tail -n 1 output_16-0.txt
NTLEPTON packet selected in cell 45 excitation of Z=28 ionstage 3 level 1 upperlevel 109
tail -n 1 output_17-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_18-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_19-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_20-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_2-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 66 Z=28 ionstage 2 lower 177 phixstargetindex 2 gamma 322850 error 200.29
tail -n 1 output_21-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_22-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_23-0.txt
NTLEPTON packet selected in cell 47 excitation of Z=28 ionstage 3 level 1 upperlevel 109
tail -n 1 output_24-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_25-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_26-0.txt
NTLEPTON packet selected in cell 48 excitation of Z=28 ionstage 3 level 0 upperlevel 104
tail -n 1 output_27-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_28-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_29-0.txt
NTLEPTON packet selected in cell 46 excitation of Z=28 ionstage 3 level 1 upperlevel 109
tail -n 1 output_30-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_3-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 59 Z=28 ionstage 2 lower 130 phixstargetindex 2 gamma 9.59045e+06 error 2320.48
tail -n 1 output_31-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_32-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_33-0.txt
NTLEPTON packet selected in cell 57 excitation of Z=28 ionstage 3 level 1 upperlevel 109
tail -n 1 output_34-0.txt
NTLEPTON packet selected in cell 32 excitation of Z=28 ionstage 3 level 3 upperlevel 123
tail -n 1 output_35-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_36-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_37-0.txt
NTLEPTON packet selected in cell 60 excitation of Z=28 ionstage 3 level 0 upperlevel 84
tail -n 1 output_38-0.txt
NTLEPTON packet selected in cell 52 excitation of Z=28 ionstage 3 level 2 upperlevel 91
tail -n 1 output_39-0.txt
NTLEPTON packet selected in cell 40 excitation of Z=28 ionstage 3 level 1 upperlevel 188
tail -n 1 output_40-0.txt
NTLEPTON packet selected in cell 33 excitation of Z=28 ionstage 3 level 1 upperlevel 109
tail -n 1 output_4-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 60 Z=28 ionstage 2 lower 116 phixstargetindex 1 gamma 5.90903e+06 error 1375.97
tail -n 1 output_41-0.txt
NTLEPTON packet selected in cell 51 excitation of Z=28 ionstage 3 level 2 upperlevel 63
tail -n 1 output_42-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_43-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_44-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_45-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_46-0.txt
NTLEPTON packet selected in cell 33 excitation of Z=28 ionstage 3 level 0 upperlevel 59
tail -n 1 output_47-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_48-0.txt
NTLEPTON packet selected in cell 40 excitation of Z=28 ionstage 3 level 1 upperlevel 59
tail -n 1 output_49-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_50-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_5-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 66 Z=28 ionstage 2 lower 118 phixstargetindex 0 gamma 5.0732e+06 error 1203.71
tail -n 1 output_51-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_52-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_53-0.txt
NTLEPTON packet selected in cell 49 excitation of Z=28 ionstage 3 level 1 upperlevel 112
tail -n 1 output_54-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_55-0.txt
NTLEPTON packet selected in cell 39 excitation of Z=28 ionstage 3 level 3 upperlevel 100
tail -n 1 output_56-0.txt
NTLEPTON packet selected in cell 28 excitation of Z=28 ionstage 3 level 0 upperlevel 103
tail -n 1 output_57-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_58-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_59-0.txt
NTLEPTON packet selected in cell 42 excitation of Z=28 ionstage 3 level 0 upperlevel 79
tail -n 1 output_60-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_6-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 66 Z=28 ionstage 2 lower 118 phixstargetindex 0 gamma 5.0732e+06 error 1203.71
tail -n 1 output_61-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_62-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_63-0.txt
[debug] update_packets: updating packet 0 for timestep 27 at time 1557478784...
tail -n 1 output_7-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 56 Z=28 ionstage 2 lower 138 phixstargetindex 2 gamma 8.08228e+06 error 2942.88
tail -n 1 output_8-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 62 Z=28 ionstage 2 lower 211 phixstargetindex 1 gamma 2.38586e+06 error 645.065
tail -n 1 output_9-0.txt
corrphotoioncoeff gsl integrator warning 18. modelgridindex 56 Z=28 ionstage 2 lower 231 phixstargetindex 2 gamma 1.38745e+06 error 362.591

ssim commented

@aborissov Thanks - looking through these tails it doesn't look like there's anything wrong there. The warnings here are not really an issue - this is just rounding error in an integration, which shouldn't cause a crash. Many of the threads have just terminated at the end of a packet update (which is fine), and some have just given diagnostic information on NTLEPTON status - which is also normal.

So not quite clear here what went wrong. Christine has suggested it might be associated with a issue in the handling of the coupling of atomic energy levels by collisions - an issue she encountered before in another context but that may apply here. Luke has tried to fix this up to see if it helps - he is trying his own run on this, which currently has got to timestep 17 that seems to be doing ok so far - but we'll leave it running to see if it really works for us. This version is on Dropbox (same place as previous model) - if you'd like to give it a go, that's great - or else Luke will update on this thread on whether the run we are trying works ok.

Was there any other evidence of something going wrong in the run you did? E.g. a segmentation fault, or a complaint form the queue system? ...or indeed to standard output from the code?

Just thought I'd let everyone know the progress with running the new versions. The good news is that neither version (regular or alternative) has crashed so far. The not so good news is that the regular version is taking much longer to run, it's been going for almost 24 hours and has only gotten to timestep 21, which is well before the crash last time, so it doesn't necessarily tell us much. On the other hand the alternative version has gotten to timestep 41 (looking back at the input file it looks like it was set up to run to timestep 100). I assume I can kill this job and restart with some profiling? Also, any ideas on why there might be a large difference in runtimes? Both were running on 72 cores, the regular for 23 hours, the alternative for 12.

ssim commented

@aborissov That's good to hear!

Comparing the two current simulations, the difference in run times is (broadly) to be expected - this is because the 1st calculation is running for early phases of the supernova explosion when the densities are higher and thus the photon packets interact much more often. This is an interesting regime to understand the profiling for because it can be the most expensive part of a full simulation but it does pose the challenge that simply running may take a long time.

I don't quite know why the "new" version of the simulation that starts at early times is much slower than the one you were running before though (i.e. the one that crashed but that also started at quite early phases) - maybe @lukeshingles can speculate on that based on the change made to the code, but it might be difficult to be sure.

Yes, you should be able to kill the "alternative" one - the one that's got to timestep 41, and then start profiling on a restart of that one - that is probably the best strategy to follow for now.