Realm::TimeLimit floating point exception

Question

Realm::TimeLimit floating point exception

Closed this issue 5 months ago · 16 comments

Our FleCSI application runs fine on 2 GPUs (1 per rank), but at 3 and 4 GPUs, realm throws a floating point exception. Here is the backtrace

(gdb) bt
#0  0x00001471c73c5cc1 in [clock_nanosleep@GLIBC_2.2.5](mailto:clock_nanosleep@GLIBC_2.2.5) () from /lib64/libc.so.6
#1  0x00001471c73cb9c3 in nanosleep () from /lib64/libc.so.6
#2  0x00001471c73cb8da in sleep () from /lib64/libc.so.6
#3  0x00001471cdcb8bea in Realm::realm_freeze(int) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#4  <signal handler called>
#5  0x00001471cdd063a2 in Realm::Cuda::GPUIndirectXferDes::progress_xd(Realm::Cuda::GPUIndirectChannel*, Realm::TimeLimit) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#6  0x00001471cdd0e05d in Realm::XDQueue<Realm::Cuda::GPUIndirectChannel, Realm::Cuda::GPUIndirectXferDes>::do_work(Realm::TimeLimit) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#7  0x00001471cdc17b01 in Realm::BackgroundWorkManager::Worker::do_work(long long, Realm::atomic<bool>*) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#8  0x00001471cdc174b8 in Realm::BackgroundWorkThread::main_loop() ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#9  0x00001471cdcdff7e in Realm::KernelThread::pthread_entry(void*) ()
   from /users/jgraham/RISTRA/.spack-develop-ede36512e/var/spack/environments/cuda-23-12-08-cr25/.spack-env/view/lib64/librealm.so.1
#10 0x00001471ccf986ea in start_thread () from /lib64/libpthread.so.0
#11 0x00001471c7401a6f in clone () from /lib64/libc.so.6

This run used the following commit:

commit 21500c7e3eb7f123b8e6c3ec2cbf8356febe3989 (HEAD)
Author: Mike [mebauer@cs.stanford.edu](mailto:mebauer@cs.stanford.edu)
Date:   Fri Feb 23 01:02:04 2024 -0800
 
    legion: fix a bug in the application of remote overwrite physical analyses

Our application works fine with this older commit:

commit 45afa8e658ae06cb19d8f0374de699b7fe4a197c (HEAD)
Merge: 0db333c9d 4dd12470a
Author: Mike [mebauer@cs.stanford.edu](mailto:mebauer@cs.stanford.edu)
Date:   Mon Jul 31 00:57:19 2023 -0700
 
    legion: merge master into control replication and resolve conflicts
and

Answer 1 · 2024-03-07T20:12:04.000Z

This is crashing in cuda-dma gather/scatter channel. @jpietarilagraham Can you describe what are you running and whether this is something that has never been tested before or came up with the new commits?

Answer 2 · 2024-03-07T20:13:16.000Z

If you can give me a reproducer on sapling2 then I should be able to triage it rather quickly

Answer 3 · 2024-03-07T20:29:03.000Z

I think the code that is tripping it is that we're running variable time steps and I'm passing dt around as a future.

Answer 4 · 2024-03-07T20:45:07.000Z

The dt value won't have any bearing on how the gather copies are being executed.

Answer 5 · 2024-03-07T21:06:56.000Z

Yes dt won't really matter here - I am afraid we are running into a bug inside gather/scatter transfer descriptor...some case that ins't covered by our test suite

Answer 6 · 2024-03-07T21:16:21.000Z

We are conducting an offline discussion in zullip with @jpietarilagraham at the moment on how best to proceed in root causing and fixing the crash. Just FYI to whoever is going to ready this thread. I will keep posting updates here as well.

Answer 7 · 2024-03-14T17:46:18.000Z

I will be sending out a patch shortly

Answer 8 · 2024-03-14T17:51:28.000Z

Do we want this in the March release?

Answer 9 · 2024-03-14T17:57:21.000Z

Practically speaking - it's unlikely we will be running into this issue but ideally we need to fix it since it's a bug

Answer 10 · 2024-03-19T14:33:28.000Z

We have verified that the patch fixes the problem. I will be submitting that shortly. @jpietarilagraham Could you please confirm?

Answer 11 · 2024-03-19T15:53:54.000Z

@apryakhin Is there an MR attached to this?

Please ping me directly before merging any changes this week.

Answer 12 · 2024-03-19T19:06:46.000Z

@elliottslaughter I will CC you to the merge-request

Answer 13 · 2024-03-19T21:11:57.000Z

!1150 fixes the problem.

Answer 14 · 2024-03-20T20:42:08.000Z

The patch has been pushed

Answer 15 · 2024-03-20T21:24:57.000Z

The fix is now in the release candidate for the upcoming release: https://gitlab.com/StanfordLegion/legion/-/commits/rc

@jpietarilagraham Please close once you confirm the issue is resolved.

Answer 16 · 2024-04-01T16:17:38.000Z

I believe this is fixed, but feel free to reopen if there is something else.