possible memory corruption with many LPs on one process
carns opened this issue · 3 comments
If I build ROSS and my model using gcc's address sanitizer and then run a model with 1920 LPs on a single MPI process, I hit the following crash:
==17983==ERROR: AddressSanitizer: unknown-crash on address 0x7ffff36b67d0 at pc 0x488b5e bp 0x7fffffffd760 sp 0x7fffffffd750
READ of size 56 at 0x7ffff36b67d0 thread T0
#0 0x488b5d in late_sanity_check /home/pcarns/working/src/ROSS/core/tw-setup.c:277
#1 0x488cf7 in tw_run /home/pcarns/working/src/ROSS/core/tw-setup.c:304
#2 0x404ebd in main ../src/models/storage/triton/triton-fault-sim.c:209
#3 0x7ffff6425a3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)
#4 0x404558 in _start (/home/pcarns/working/src/codes-triton/build/src/models/storage/triton/triton-fault-sim+0x404558)
0x7ffff36b6800 is located 0 bytes to the right of 524288-byte region [0x7ffff3636800,0x7ffff36b6800)
allocated by thread T0 here:
#0 0x7ffff6f54827 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x57827)
#1 0x48ab0c in my_malloc /home/pcarns/working/src/ROSS/core/tw-util.c:229
#2 0x48a7aa in pool_alloc /home/pcarns/working/src/ROSS/core/tw-util.c:161
#3 0x48a9fe in tw_calloc /home/pcarns/working/src/ROSS/core/tw-util.c:208
#4 0x479800 in init_q /home/pcarns/working/src/ROSS/core/network-mpi.c:82
#5 0x479cdf in tw_net_start /home/pcarns/working/src/ROSS/core/network-mpi.c:143
#6 0x4879f2 in tw_init /home/pcarns/working/src/ROSS/core/tw-setup.c:72
#7 0x404b95 in main ../src/models/storage/triton/triton-fault-sim.c:148
#8 0x7ffff6425a3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)
If I look at it in gdb I see this:
#7 0x0000000000488b5e in late_sanity_check ()
at /home/pcarns/working/src/ROSS/core/tw-setup.c:277
277 if (!memcmp(&lp->type, &null_type, sizeof(null_type))) {
(gdb) print i
$1 = 1911
(gdb) print &lp->type
$2 = (tw_lptype **) 0x7ffff36b67d0
I apologize that I don't have a concrete test case, but maybe someone happens to know why it looks like I'm running off the end of a memory allocation on lp->type on the 1911'th iteration?
Confirmed with airport model. The quick hack way to trigger this is to modify line airport.c:192 from this:
nlp_per_pe /= (tw_nnodes() * g_tw_npe);
to this:
nlp_per_pe = 1920;
(I understand that hack probably terribly breaks the model in the big picture, I just wanted to see if it triggered the same issue in late_sanity_check()). It appears to:
==19418==ERROR: AddressSanitizer: unknown-crash on address 0x7fb28268e7d0 at pc 0x4190d0 bp 0x7ffffb812170 sp 0x7ffffb812160
READ of size 56 at 0x7fb28268e7d0 thread T0
#0 0x4190cf in late_sanity_check /home/pcarns/working/src/ROSS/core/tw-setup.c:277
#1 0x419269 in tw_run /home/pcarns/working/src/ROSS/core/tw-setup.c:304
#2 0x4050c5 in main /home/pcarns/working/src/ROSS/models/ROSS-Models/airport/airport.c:203
#3 0x7fb2853fda3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)
#4 0x403738 in _start (/home/pcarns/working/src/ROSS/models/ROSS-Models/airport/airport+0x403738)
0x7fb28268e800 is located 0 bytes to the right of 524288-byte region [0x7fb28260e800,0x7fb28268e800)
allocated by thread T0 here:
#0 0x7fb285d24827 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x57827)
#1 0x41b07e in my_malloc /home/pcarns/working/src/ROSS/core/tw-util.c:229
#2 0x41ad1c in pool_alloc /home/pcarns/working/src/ROSS/core/tw-util.c:161
#3 0x41af70 in tw_calloc /home/pcarns/working/src/ROSS/core/tw-util.c:208
#4 0x409d72 in init_q /home/pcarns/working/src/ROSS/core/network-mpi.c:82
#5 0x40a251 in tw_net_start /home/pcarns/working/src/ROSS/core/network-mpi.c:143
#6 0x417f64 in tw_init /home/pcarns/working/src/ROSS/core/tw-setup.c:72
#7 0x404f5f in main /home/pcarns/working/src/ROSS/models/ROSS-Models/airport/airport.c:190
#8 0x7fb2853fda3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a3f)
are you sure this isn't an issue with AddressSanitizer or your local setup?
I can run the airport model with over 9000 LPs on a single core (sequential mode) on my machine. When I finally allocate too many LPs, my program fails somewhere else.
It turns out that the number of LPs was a red herring; it was really just luck of the draw for that specific case to cause memcmp() to look at memory that AddressSanitizer would complain about. AddressSanitizer was right but my ability to draw conclusions from its output was not.
Fix is a single character change, but I'm doing a pull request to get more practice doing things the github way.