SWIFTSIM/SWIFT

Gasoline and Anarchy-PU crashing with additional physics

FHusko opened this issue · 7 comments

Hi SWIFT team,

I have been attempting some simulations using the Gasoline and Anarchy-PU hydro schemes. The setup is a spherically symmetric gas halo initially in hydrostatic equilibrium, using an external NFW potential. I have tested the setup with SPHENIX very well at this point across different resolution levels, up to 3 Gyr. Gasoline/Anarchy-PU both crash at around 100 Myr. I don't remember the exact error I got with Anarchy-PU, but this is what I get with gasoline:

[m7124:86540:0:86720] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af38c120440)
[m7127:123101:0:123272] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b31b42e27a0)
==== backtrace (tid:  86720) ====
0 0x00000000004ea4d2 space_parts_get_cell_index_mapper()  ???:0
1 0x000000000049b618 threadpool_runner()  threadpool.c:0
2 0x0000000000007ea5 start_thread()  pthread_create.c:0
3 0x00000000000fe9fd __clone()  ???:0
=================================
==== backtrace (tid: 123272) ====
0 0x00000000004ea4d2 space_parts_get_cell_index_mapper()  ???:0
1 0x000000000049b618 threadpool_runner()  threadpool.c:0
2 0x0000000000007ea5 start_thread()  pthread_create.c:0
3 0x00000000000fe9fd __clone()  ???:0
=================================
[m7125:70337:0:70525] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ac87ce8cb90)
==== backtrace (tid:  70525) ====
0 0x00000000004ea4d2 space_parts_get_cell_index_mapper()  ???:0
1 0x000000000049b618 threadpool_runner()  threadpool.c:0
2 0x0000000000007ea5 start_thread()  pthread_create.c:0
3 0x00000000000fe9fd __clone()  ???:0 

The run command is

mpirun -np 8 /cosma/home/durham/dc-husk1/SWIFT_SPH/swiftsim/examples/swift_mpi --external-gravity --self-gravity --hydro --temperature --threads=14 --limiter --sync --pin params.yml

This is with 4 nodes of cosma7, I have attempted both non-MPI and cosma6, I get errors regardless. The configure option is

--with-cooling=COLIBRE --with-chemistry=EAGLE --enable-boundary-particles=10000000 --with-hydro=gasoline --with-gravity=with-multi-softening --with-tracers=EAGLE --with-ext-potential=nfw

The paramater file contains the following:

metaData:
  run_name:   IsolatedGalaxy-EAGLE-Ref

# Define the system of units to use internally.
InternalUnitSystem:
  UnitMass_in_cgs:     1.98848e43    # 10^10 M_sun in grams
  UnitLength_in_cgs:   3.08566e21 # 1 kpc in cm
  UnitVelocity_in_cgs: 1e5           # 1 km/s in cm/s
  UnitCurrent_in_cgs:  1             # Amperes
  UnitTemp_in_cgs:     1             # Kelvin

# Parameters for the self-gravity scheme
Gravity:
  eta:          0.025                 # Constant dimensionless multiplier for time integration.
  MAC:          geometric
  theta_cr:     0.7                   # Opening angle (Multipole acceptance criterion).
  use_tree_below_softening:  0
  max_physical_DM_softening:     0.3 # Physical softening length (in internal units).
  max_physical_baryon_softening: 0.3 # Physical softening length (in internal units).
  mesh_side_length:              256

# Parameters governing the time integration (Set dt_min and dt_max to the same value for a fixed time-step run.)
TimeIntegration:
  time_begin:        0.    # The starting time of the simulation (in internal units).
  time_end:          2   # The end time of the simulation (in internal units).
  dt_min:            1e-14  # The minimal time-step size of the simulation (in internal units).
  dt_max:            1e-2  # The maximal time-step size of the simulation (in internal units).

# Parameters governing the snapshots
Snapshots:
  basename:              output      # Common part of the name of output files
  time_first:            0.          # Time of the first output if non-cosmological time-integration (in internal units)
  delta_time:            0.0125       # Time difference between consecutive outputs (in internal units)
  compression:           7           # Compress the snapshots
  select_output_on:      1
  select_output:         param_list.yml
  output_list_on:        1
  output_list:           output_list.txt

Restarts:
  delta_hours:           1

Scheduler:
  max_top_level_cells:   20

# Parameters governing the conserved quantities statistics
Statistics:
  delta_time:           1e-1     # Time between statistics output
  time_first:              0     # (Optional) Time of the first stats output if non-cosmological time-integration (in internal units)

# Parameters related to the initial conditions
InitialConditions:
  file_name:               ICs.hdf5 # The file to read
  periodic:                0            # Are we running with periodic ICs?
#  stars_smoothing_length:  0.6

# Parameters for the hydrodynamics scheme
SPH:
  resolution_eta:        1.2348   # Target smoothing length in units of the mean inter-particle separation (1.2348 == 48Ngbs with the cubic spline kernel).
  CFL_condition:         0.2      # Courant-Friedrich-Levy condition for time integration.
  h_min_ratio:           0.1      # Minimal smoothing in units of softening.
  h_max:                 10.
  minimal_temperature:   100.

# Standard EAGLE cooling options
EAGLECooling:
  dir_name:                /cosma6/data/dp004/dc-husk1/SWIFT/IsolatedGalaxy/IsolatedGalaxy_feedback/coolingtables/  # Location of the Wiersma+09 cooling tables
  H_reion_z:               7.5               # Redshift of Hydrogen re-ionization
  H_reion_eV_p_H:          2.0               # Energy inject by Hydrogen re-ionization in electron-volt per Hydrogen atom
  He_reion_z_centre:       3.5               # Redshift of the centre of the Helium re-ionization Gaussian
  He_reion_z_sigma:        0.5               # Spread in redshift of the  Helium re-ionization Gaussian
  He_reion_eV_p_H:         2.0               # Energy inject by Helium re-ionization in electron-volt per Hydrogen atom

# COLIBRE cooling parameters
COLIBRECooling:
  dir_name:                /cosma6/data/dp004/dc-husk1/SWIFT/IsolatedGalaxy/IsolatedGalaxy_feedback/UV_dust1_CR1_G1_shield1.hdf5 # Location of the Ploeckinger+20 cooling tables
  H_reion_z:               7.5               # Redshift of Hydrogen re-ionization (Planck 2018)
  H_reion_eV_p_H:          2.0
  He_reion_z_centre:       3.5               # Redshift of the centre of the Helium re-ionization Gaussian
  He_reion_z_sigma:        0.5               # Spread in redshift of the  Helium re-ionization Gaussian
  He_reion_eV_p_H:         2.0               # Energy inject by Helium re-ionization in electron-volt per Hydrogen atom
  delta_logTEOS_subgrid_properties: 0.3      # delta log T above the EOS below which the subgrid properties use Teq assumption
  rapid_cooling_threshold:          0.333333 # Switch to rapid cooling regime for dt / t_cool above this threshold.

# Use solar abundances
EAGLEChemistry:
  init_abundance_metal:     0.0129
  init_abundance_Hydrogen:  0.7065
  init_abundance_Helium:    0.2806
  init_abundance_Carbon:    0.00207
  init_abundance_Nitrogen:  0.000836
  init_abundance_Oxygen:    0.00549
  init_abundance_Neon:      0.00141
  init_abundance_Magnesium: 0.000591
  init_abundance_Silicon:   0.000683
  init_abundance_Iron:      0.0011

# NFW potential parameters
NFWPotential:
  useabspos:          0             # 0 -> positions based on centre, 1 -> absolute positions
  position:           [0.0,0.0,0.0] # Location of centre of the NFW potential with respect to centre of the box (internal units) if useabspos=0 otherwise with respect to the 0,0,0, coordinates.
  concentration:      5.6            # Concentration of the halo
  M_200:              10000.         # Mass of the halo (M_200 in internal units)
  critical_density:   1.36e-8       # Critical density (internal units).
  timestep_mult:      0.025          # Dimensionless pre-factor for the time-step condition, basically determines fraction of orbital time we need to do an integration step
  epsilon:            0.3
  h:                  0.7

I have tried using debug, debugging checks and sanitizer, these didn't yield any additional error-related info that I could see. I am running with these again and will share the new code outputs if you think that will help.

Thanks for the help in advance!

Edit: Here's the output with debugging turned on.

[m7031:259876:0:260037] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b14b40008c0)
[m7028:86884:0:87046] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ab4d45beb50)
[m7029:265468:0:265468] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe07982770)

/cosma/home/durham/dc-husk1/SWIFT_SPH/swiftsim/src/space_cell_index.c: [ space_parts_get_cell_index_mapper() ]
      ...
      145       /* Is this a place-holder for on-the-fly creation? */
      146       ind[k] = index;
      147       cell_counts[index]++;
==>   148       ++count_extra_part;
      149
      150     } else {
      151       /* Normal case: list its top-level cell index */

==== backtrace (tid:  87046) ====
 0 0x00000000004ea5fc space_parts_get_cell_index_mapper()  /cosma/home/durham/dc-husk1/SWIFT_SPH/swiftsim/src/space_cell_index.c:148
 1 0x000000000049b24a threadpool_chomp()  /cosma/home/durham/dc-husk1/SWIFT_SPH/swiftsim/src/threadpool.c:164
 2 0x000000000049b24a threadpool_runner()  /cosma/home/durham/dc-husk1/SWIFT_SPH/swiftsim/src/threadpool.c:191
 3 0x0000000000007ea5 start_thread()  pthread_create.c:0
 4 0x00000000000fe9fd __clone()  ???:0
=================================

Hi Filip,

That is the ultimate symptom of a run where some particle got an invalid position; probably after receiving an invalid acceleration. I don't know where the problem originates exactly, but I would think that some of the subgrid schemes may not be coupled properly to these hydro schemes.

You could try configuring with --enable-debugging-checks --enable-debug and then run in a debugger to see whether any of the internal consistency checks trigger. You can also add -e at runtime and run in a debugger to get the code to stop exactly on the first invalid math operation.

Sorry I don't have a direct solution but hopefully more diagnostics will help.

Hi Matthieu,

Thanks for the reply! Earlier I tried restarting with some of the physics turned off (external gravity, self-gravity, no boundary particles), the issue still popped up. I'll run with -e and see if that helps.

Any news on this? Is it still an issue?

I ended up having problems running with debugging all the way to 100 Myr (where the error happens). When I start with debugging from the beginning, I get unrelated error messages during the launching of the jets. This happens with sphenix too.

I recently changed the setup slightly for unrelated reasons. I'm running with gasoline/anarchy_pu in the updated setup. This is now hydrodynamics with temperature (no cooling) and external gravity (nfw), no other physics. I'll post the errors if I get them.

Edit: anarchy_pu now runs up until 500 Myr, at which point a single particle starts demanding successively smaller time steps (down to the smallest, 1e-14 Gyr). Not sure what's up with that. I noticed that happening earlier as well, but choosing a smaller grid in the scheduler portion of the parameter file seemed to help. This now happens with 12x12x12 top-level cells.

Gasoline runs until 150 Myr and fails then with a segmentation fault. Running with debugging, sanitizer and -e from the beginning ends with a floating point exception right from the beginning:

[00531.4] space_rebuild: (re)building space [01666.1] engine_init_particles: Converting internal energy variable. [01672.0] engine_init_particles: Running initial fake time-step. [01672.0] space_rebuild: (re)building space ./run.sh: line 31: 174975 Floating point exception/cosma/home/durham/dc-husk1/SWIFT_SPH/swiftsim/examples/swift -e --external-gravity --hydro --temperature --threads=28 --limiter --sync --pin isolated_galaxy.yml

This makes me think it's to do with the initial conditions, perhaps a particle too close to the centre of the potential. However, I made sure that the one exactly at (0,0,0) is not present.

Made some changes to setup again. Gasoline now complains during the jet launching phase, so this may be more related to issues with the physics I added. The specific error I get is
[01686.5] space.c:space_check_limiter_mapper():2176: Synchronized particle not treated! id=0 synchronized=1
, and this happens immediately after I launch a pair of particles as part of the jet kicking.

That would indicate an issue with the time-step limiter not working properly on the particle you added. Which, if true, would likely lead to issues down the line.

Now.. that should not depend on the hydro flavour as the limiter is the same in all hydro schemes.

Okay, that must mean I am doing something else in Gasoline in terms of the limiter, because anarchy_pu, sphenix and (to be checked) minimal sph all run fine. Will check that out.

Edit: Oddly enough, the code is identical for gasoline and other ones. In particular, I call timestep_sync_part(p) at the end of the kicking event for the particle