SWIFTSIM/SWIFT

Interacting unsorted cells

rennehan opened this issue · 15 comments

Hello,

I am adding a black hole model into Swift and encountered the following fatal error:

1707 4.251840e-04 0.0830958 11.0343041 9.438521e-08 41 41 2 2 0 0 0 3.139 0 1.375
[03640.3] runner_doiact_functions_hydro.h:runner_dopair_subset_branch_density():883: Interacting unsorted cells.

My commits/branch are public and located here:

https://github.com/romeeld/swiftsim/commits/rennehan_feature_blackholes

We are running the Eagle s25n188 volumes with SPHENIX hydrodynamics and COLIBRE cooling and EAGLE stars. We are using SIMBA feedback and SIMBA black holes.

My branch is a branch off of the "Swimba" branch. In Swimba, we added decoupled stellar winds from the hydrodynamics scheme and have not encountered this error with the new decoupled winds scheme. After I added in the black hole model, I receive this error very early (z ~ 11). I implemented a kinetic kick for black hole feedback, and also decouple the black hole winds for a short period of time.

All decoupled winds skip the cooling step completely (in the runner_others.c file in my branch, line 157). Decoupled winds are also not considered for accretion/feedback in the black holes modules. We allow decoupled winds to contribute to the hydro density, and all other density loops.

At this time, there has been no black hole feedback, but there has been black hole accretion. Our black hole accretion model relies on a new loop over the star particle neighbours of black hole particles (BH loop over star particles). I added in 3 additional loops in:

runner_doiact_functions_black_holes.h

Which can be found in that file by looking for the "bh_stars_loop_is_active" function. After the stellar properties are calculated, the black holes nibble on gas particles and take a small fraction of their mass (about 10%). They are temporarily marked as "to be swallowed" until the black hole feedback step, when they are then marked "not swallowed" since we are only nibbling.

If there is any other information I can give, please let me know.

Thank you!

Could you describe how the decoupling is implemented?

Also, how did you guaranty that the BH action takes place after the stellar properties are computed?

In hydro_iact.h for SPHENIX we have:

if (pi->feedback_data.decoupling_delay_time <= 0.f && pj->feedback_data.decoupling_delay_time <= 0.f) {
      pi->a_hydro[0] -= mj * acc * dx[0];
      pi->a_hydro[1] -= mj * acc * dx[1];
      pi->a_hydro[2] -= mj * acc * dx[2];`
      
     pj->a_hydro[0] += mi * acc * dx[0];
      pj->a_hydro[1] += mi * acc * dx[1];
      pj->a_hydro[2] += mi * acc * dx[2];
  }

The decoupling_delay_time parameter is set during a stellar feedback/black hole feedback kick event to some value. In this function we decrease the delay time until it reaches <0:

__attribute__((always_inline)) INLINE static void feedback_update_part(
    struct part* p, struct xpart* xp, const struct engine* e, 
    const int with_cosmology) {

  /* No reason to do this is the decoupling time is zero */
  if (p->feedback_data.decoupling_delay_time > 0.f) {
    const integertime_t ti_step = get_integer_timestep(p->time_bin);
    const integertime_t ti_begin =
        get_integer_time_begin(e->ti_current - 1, p->time_bin);

    /* Get particle time-step */
    double dt_part;
    if (with_cosmology) {
      dt_part = cosmology_get_delta_time(e->cosmology, ti_begin,
                                          ti_begin + ti_step);
    } else {
      dt_part = get_timestep(p->time_bin, e->time_base);
    }

    p->feedback_data.decoupling_delay_time -= dt_part;
    if (p->feedback_data.decoupling_delay_time <= 0.f) {
      p->feedback_data.decoupling_delay_time = 0.f;
    }
  } else {
    p->feedback_data.decoupling_delay_time = 0.f;
  }
}

feedback_update_part is called right before drifting. The loop over star particles for each black hole is done in the black hole density loop (I hope?) by using the following:

#if (FUNCTION_TASK_LOOP == TASK_LOOP_DENSITY)
      /* Only do bh-stars loop if necessary for the model. */
      if (bh_stars_loop_is_active(bi, e)) {
        for (int sjd = 0; sjd < scount; sjd++) {
          struct spart *restrict sj = &sparts[sjd];

          /* Compute the pairwise distance. */
          const float sjx[3] = {(float)(sj->x[0] - c->loc[0]),
                                (float)(sj->x[1] - c->loc[1]),
                                (float)(sj->x[2] - c->loc[2])};
          const float dx[3] = {bix[0] - sjx[0], bix[1] - sjx[1], bix[2] - sjx[2]};
          const float r2 = dx[0] * dx[0] + dx[1] * dx[1] + dx[2] * dx[2];

          /* Star is within smoothing length of black hole */
          if (r2 < hig2) {
            runner_iact_nonsym_bh_stars_density(r2, dx, bi, sj);
          }
        }
        /* TODO: One possible speed-up is just to flag the BH id 
        * that each star is associated with in the previous loop,
        * and then just use that to loop instead of doing the distance
        * calculation everytime.
        */

        /* Now that we have the angular momentum, find the bulge mass */
        for (int sjd = 0; sjd < scount; sjd++) {
          struct spart *restrict sj = &sparts[sjd];

          /* Compute the pairwise distance. */
          const float sjx[3] = {(float)(sj->x[0] - c->loc[0]),
                                (float)(sj->x[1] - c->loc[1]),
                                (float)(sj->x[2] - c->loc[2])};
          const float dx[3] = {bix[0] - sjx[0], bix[1] - sjx[1], bix[2] - sjx[2]};
          const float r2 = dx[0] * dx[0] + dx[1] * dx[1] + dx[2] * dx[2];

          /* Star is within smoothing length of black hole */
          if (r2 < hig2) {
            runner_iact_nonsym_bh_stars_bulge(r2, dx, bi, sj);
          }
        }
      }
#endif

That loop is done right before the loop over all gas neighbours. That code is run in 3 different functions in the black hole functions runner:

DOSELF1_BH
DO_NONSYM_PAIR1_BH_NAIVE
DOSELF1_SUBSET_BH

The BH bits seem sensible.

Why not also change DOPAIR1_SUBSET_BH_NAIVE?

The decoupling is not doing what is usually understood by decoupling in the literature I think. All you are doing here is not adding hydro accelerations to the particle. It will still participate in all the other loops and pieces of code. For instance in the density estimate or feedback.
I don't know whether this leads to issues however. Likely not but it means interactions are not consistent between the loops.

You could also try switching on all the internal consistency checks by configuring with --enable-debugging-checks.

Thank you for noticing that, I will also add it into that function.

I looked deeper into the decoupling and noticed that it was NOT done in runner_iact_nonsym_force for nonsymmetric interactions. I changed it to make sure that the hydro accelerations and energy transfer were not done there. I do not get the error now, so hopefully that was the solution. I also made sure that all of the other sub-grid physics loops always ignore decoupled particles, just in case.

I will leave this open in case it pops up again.

It does seem like the code is sensitive to decoupling (as it should be!). One question came out of me digger deeper:

What about the time derivative of h?

p->force.h_dt

In the default decoupling implementation that I began with that was still being computed and updated even for decoupled particles. I ran a test where I did not update h_dt for particles interacting with a wind, and the simulation did not crash. That should be the correct thing to do, right?

Thanks.

dh/dt is not super important as it is mainly used to get a good guess at h for the next step.

If you do ignore the decoupled particles in the density loop (as I think you should) then I would also not update dh/dt. Such that the estimate of h is (more) consistent with the density field.

Note that you should also not update du/dt in the force loops for the decoupled particles.

Right, thanks for the info. Indeed we are not computing du/dt for decoupled particles.

I forgot to mention that I tried decoupling particles from the main hydro density loop. However, that led to serious issues with particles having zero neighbours and maximum softening lengths, and then to having a timestep of 0 and crashing swift.

It is possible for me to go through and simply ignore these errors until recoupling though, so I might test that out.

The tests I ran made it to z = 0 and nothing looks amiss, so for now I will leave it as a TODO.

Ok. I can imagine that it will go wrong indeed. As mentioned above, there is a lot more that should be done to properly define decoupled particles in the way usually assumed.

But if this now works for you, then that's the main bit. Glad it's resolved. :)

This could possibly be a compiler problem. I was seeing this with gcc/11.x compilers on some machines in Edinburgh, but I have not seen the problem since I moved to our Canadian cluster and have been using the latest Intel compilers. I haven't tested extensively, but will update if I get a chance. Definitely not fixed yet!

That sounds very suspicious. If that is really what is going on then I'd rather think that something somewhere relies on not-well-defined behaviour or on some form of thread-race condition that different compilers implement in a different way. But you are then relying on bad coding to get things to work in the one compiler that seemingly does not crash the code.

Indeed, this is just conjecture at this point. I will note I have not seen the problem on any system but the one in Edinburgh. I did run the EAGLE-XL model on the CCA cluster, and did not see the issue with a gcc/11.x compiler.

One small note so that I don't forget. The implementation of the DM vel. disp. won't be robust over MPI. That's not a problem for now but if you intend to scale things up additional communications will be required in various places.

Hi Doug,

We may have fixed something related to the original problem last week. I don't know whether it is actually the same situation you are facing here but there have been code logic changes around that section.
Note that we are still confused about the GCC vs. ICC difference.

May or may not help.

Are you still encountering this issue?

We have not seen this issue since last time we spoke.

Let's re-open or create a new issue if that problem reappears then.