StanfordLegion/legion

Realm::TimeLimit floating point exception

Closed this issue · 16 comments

Our FleCSI application runs fine on 2 GPUs (1 per rank), but at 3 and 4 GPUs, realm throws a floating point exception. Here is the backtrace

(gdb) bt
#0  0x00001471c73c5cc1 in [clock_nanosleep@GLIBC_2.2.5](mailto:clock_nanosleep@GLIBC_2.2.5) () from /lib64/libc.so.6
#1  0x00001471c73cb9c3 in nanosleep () from /lib64/libc.so.6
#2  0x00001471c73cb8da in sleep () from /lib64/libc.so.6
#3  0x00001471cdcb8bea in Realm::realm_freeze(int) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#4  <signal handler called>
#5  0x00001471cdd063a2 in Realm::Cuda::GPUIndirectXferDes::progress_xd(Realm::Cuda::GPUIndirectChannel*, Realm::TimeLimit) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#6  0x00001471cdd0e05d in Realm::XDQueue<Realm::Cuda::GPUIndirectChannel, Realm::Cuda::GPUIndirectXferDes>::do_work(Realm::TimeLimit) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#7  0x00001471cdc17b01 in Realm::BackgroundWorkManager::Worker::do_work(long long, Realm::atomic<bool>*) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#8  0x00001471cdc174b8 in Realm::BackgroundWorkThread::main_loop() ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#9  0x00001471cdcdff7e in Realm::KernelThread::pthread_entry(void*) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#10 0x00001471ccf986ea in start_thread () from /lib64/libpthread.so.0
#11 0x00001471c7401a6f in clone () from /lib64/libc.so.6

This run used the following commit:

commit 21500c7e3eb7f123b8e6c3ec2cbf8356febe3989 (HEAD)
Author: Mike [mebauer@cs.stanford.edu](mailto:mebauer@cs.stanford.edu)
Date:   Fri Feb 23 01:02:04 2024 -0800
 
    legion: fix a bug in the application of remote overwrite physical analyses

Our application works fine with this older commit:

commit 45afa8e658ae06cb19d8f0374de699b7fe4a197c (HEAD)
Merge: 0db333c9d 4dd12470a
Author: Mike [mebauer@cs.stanford.edu](mailto:mebauer@cs.stanford.edu)
Date:   Mon Jul 31 00:57:19 2023 -0700
 
    legion: merge master into control replication and resolve conflicts
and

This is crashing in cuda-dma gather/scatter channel. @jpietarilagraham Can you describe what are you running and whether this is something that has never been tested before or came up with the new commits?

If you can give me a reproducer on sapling2 then I should be able to triage it rather quickly

I think the code that is tripping it is that we're running variable time steps and I'm passing dt around as a future.

The dt value won't have any bearing on how the gather copies are being executed.

Yes dt won't really matter here - I am afraid we are running into a bug inside gather/scatter transfer descriptor...some case that ins't covered by our test suite

We are conducting an offline discussion in zullip with @jpietarilagraham at the moment on how best to proceed in root causing and fixing the crash. Just FYI to whoever is going to ready this thread. I will keep posting updates here as well.

I will be sending out a patch shortly

Do we want this in the March release?

Practically speaking - it's unlikely we will be running into this issue but ideally we need to fix it since it's a bug

We have verified that the patch fixes the problem. I will be submitting that shortly. @jpietarilagraham Could you please confirm?

@apryakhin Is there an MR attached to this?

Please ping me directly before merging any changes this week.

@elliottslaughter I will CC you to the merge-request

!1150 fixes the problem.

The patch has been pushed

The fix is now in the release candidate for the upcoming release: https://gitlab.com/StanfordLegion/legion/-/commits/rc

@jpietarilagraham Please close once you confirm the issue is resolved.

I believe this is fixed, but feel free to reopen if there is something else.