ut-parla/Parla.py

Segfaults in get_nprocs

insertinterestingnamehere opened this issue · 8 comments

We've been seeing mysterious segfaults in get_nprocs when threads are used together with VECs. The exact conditions that trigger this aren't known since lots of things still appear to work fine.

With the ARPACK demo this does show up, but only if many copies are used (e.g. one ARPACK copy per core, so increase the limit then run 24 copies or something). I most recently saw it there when masively oversubscribed though since I wasn't setting OMP_NUM_THREADS there yet. I wasn't able to get an informative backtrace beyond seeing get_nprocs at the bottom of it.

@hfingler saw segfaults like this several times when debugging the Galois/VECs demo. Here are two backtraces that we saw:

0x7f1db871385f: (killpg+0x40)                                                                                                                                                        
 (killpg+0x40)                                                         
0x7f1db871385f:0x7f1db871385f: (killpg+0x40)                                                                     
0x7f1db871385f: (get_nprocs+0x11f)                                                                               
0x7f1db869defb: (get_nprocs+0x11f)                                                                                                             
 (get_nprocs+0x11f)                                                                                              
0x7f1db869defb: (get_nprocs+0x11f)                                                                                                                                                            
0x7f1db869defb:0x7f1db869defb: (arena_get2.part.4+0x19b)
0x7f1db86a0dc9: (arena_get2.part.4+0x19b)                                                                                                                                      
 (arena_get2.part.4+0x19b)                                                                                                                                                                    
0x7f1db86a0dc9:0x7f1db86a0dc9: (arena_get2.part.4+0x19b)                                                     
0x7f1db86a0dc9: (tcache_init.part.6+0xb9)                                                                                                                                             
0x7f1db86a1b9e: (tcache_init.part.6+0xb9)                                               
 (tcache_init.part.6+0xb9)                                                                             
0x7f1db86a1b9e:0x7f1db86a1b9e: (tcache_init.part.6+0xb9)                                                                             
0x7f1db86a1b9e: (__libc_malloc+0xde) 

Another one:

0x7ff896bc5850: (handler+0x28)
0x7ff896bc5850: (killpg+0x40)
0x7ff896c8685f:----- Galois setting # threads to 24
Galois: load_file:304 0x7ff880002680
Reading from file: inputs/r4-2e26.gr
 (killpg+0x40)
0x7ff896c8685f: (get_nprocs+0x11f)
0x7ff896c10efb: (get_nprocs+0x11f)
0x7ff896c10efb: (arena_get2.part.4+0x19b)
0x7ff896c13dc9: (arena_get2.part.4+0x19b)
0x7ff896c13dc9: (tcache_init.part.6+0xb9)
 (tcache_init.part.6+0xb9)
0x7ff896c14b9e:0x7ff896c14b9e: (__libc_malloc+0xde)
 (__libc_malloc+0xde)
0x7ff897e952f5:0x7ff897e952f5: (tls_get_addr_tail+0x165)
 (tls_get_addr_tail+0x165)
0x7ff897e9ae08:0x7ff897e9ae08: (__tls_get_addr+0x38)
 (__tls_get_addr+0x38)
0x7ff88b30a422:0x7ff88b30a422: (_ZTHN6galois9substrate10ThreadPool6my_boxE+0x14)
 (_ZTHN6galois9substrate10ThreadPool6my_boxE+0x14)
0x7ff88b2db545:0x7ff88b2db545: (_ZTWN6galois9substrate10ThreadPool6my_boxE+0x9)

@sestephens73 at one point saw this one as well when working on the matmul demo (I'm not sure what the workaround to avoid this there was):

0x7fa9598e2188: (handler+0x28)
0x7fa95c966400: (killpg+0x40)
0x7fa95bf7837f: (get_nprocs+0x11f)
0x7fa95bf02aab: (arena_get2.part.4+0x19b)

Here's my current best theory for what might be causing this: our overrides in libparla_context may not be getting correctly preloaded. The resulting shared object has libc as a dependency in its ELF header. Given that to "preload" it into each VEC we just dlmopen libparla_context, this probably means libc's stuff actually gets loaded before our overrides. The only overrides we've actually observed being called successfully from within a VEC are the ones involving pthreads routines. I think the fix is to build libparla_context with undefined symbols so that it doesn't explicitly list libc as a dependency. That'll let us do the equivalent of LD_PRELOAD but within a linker namespace.

All that said, that theory doesn't necessarily mean that there couldn't also be something wrong with our thread affinity wrappers.

The error happens most of the times we run. Eventually a run works. I think it is a per-thread issue since if I run with less cores, the error happens least frequently. With more cores I might see the error one or more times.
This is also seen from the functions __tls_get_addr, tls_get_addr_tail which goes in __libc_malloc

This seems really close to what we're seeing https://sourceware.org/legacy-ml/libc-help/2019-06/msg00026.html

Probably related: #12

@sestephens73 mentioned on slack that this showed up in the matmul demo as well. The backtrace there was

0x7fa9598e2188: (handler+0x28)
0x7fa95c966400: (killpg+0x40)
0x7fa95bf7837f: (get_nprocs+0x11f)
0x7fa95bf02aab: (arena_get2.part.4+0x19b)

I don't remember what the exact conditions to reproduce it for that app were. @sestephens73 feel free to add more details if you have them.

Here's an alternate theory as to what could cause this: the current VEC is a thread-local. Spawned threads don't automatically inherit the values of the thread-local variables of the thread that spawned them. Maybe somehow a newly created thread is resolving some thread affinity related stuff in VEC 0 since its thread-local data will be zero-initialized. That could result in some kind of weird failure when shuttling affinity information back and forth.

I tried to handle this by hooking into thread creation. But I might have done it wrong, or not hooked in deeply enough.