BerkeleyLab/matcha

Single-image crash in ifx libc

Closed this issue · 5 comments

Steps to reproduce the runtime error:

git clone -b use_ifx https://github.com/berkeleylab/matcha
cd matcha
cp templates/fpm.toml-template ./fpm.toml
export FOR_COARRAY_NUM_IMAGES=1
$ fpm run --compiler ifx --flag "-coarray=shared"
 + mkdir -p build/dependencies
Initialized empty Git repository in /storage/users/rouson/tmp/matcha/build/dependencies/assert/.git/
remote: Enumerating objects: 29, done.
remote: Counting objects: 100% (29/29), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 29 (delta 0), reused 16 (delta 0), pack-reused 0
Unpacking objects: 100% (29/29), 13.94 KiB | 528.00 KiB/s, done.
From https://github.com/sourceryinstitute/assert
 * branch            a3065a9dffaedf085fbd262c6bf31b309aa43a4a -> FETCH_HEAD
distribution_m.f90                     done.
input_m.f90                            done.
assert_m.F90                           done.
characterizable_m.f90                  done.
data_partition_m.f90                   done.
input_s.f90                            done.
t_cell_collection_m.f90                done.
assert_s.F90                           done.
intrinsic_array_m.F90                  done.
data_partition_s.F90                   done.
do_concurrent_m.f90                    done.
output_m.f90                           done.
t_cell_collection_s.F90                done.
intrinsic_array_s.F90                  done.
matcha_m.f90                           done.
distribution_s.F90                     done.
do_concurrent_s.f90                    done.
output_s.f90                           done.
matcha_s.F90                           done.
main.F90                               done.
libmatcha.a                            done.
matcha                                 done.
[100%] Project compiled successfully.
[jupiter:2521117:0:2521117] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f6ade4e1498)
==== backtrace (tid:2521117) ====
 0  /lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f6b4c42be74]
 1  /lib/libucs.so.0(+0x3008f) [0x7f6b4c42c08f]
 2  /lib/libucs.so.0(+0x303c4) [0x7f6b4c42c3c4]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7f6b5104a420]
 4  build/ifx_C4BBCE17D21A365D/app/matcha() [0x40b43f]
 5  build/ifx_C4BBCE17D21A365D/app/matcha() [0x40a308]
 6  build/ifx_C4BBCE17D21A365D/app/matcha() [0x407469]
 7  build/ifx_C4BBCE17D21A365D/app/matcha() [0x40552e]
 8  build/ifx_C4BBCE17D21A365D/app/matcha() [0x40527d]
 9  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f6b50e68083]
10  build/ifx_C4BBCE17D21A365D/app/matcha() [0x40519e]
=================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 2521117 RUNNING AT jupiter
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
<ERROR> Execution failed for object " matcha "
<ERROR>*cmd_run*:stopping due to failed executions
STOP 1

adding options "-g -O0 -traceback -coarray"
and
export I_MPI_FABRICS=shm
the traceback looks like :
`[100%] Project compiled successfully.

  • build/ifx_97488F9229162D1C/app/matcha
    forrtl: severe (174): SIGSEGV, segmentation fault occurred
    In coarray image 1
    Image PC Routine Line Source
    libpthread-2.28.s 00007F5B4D8E8C20 Unknown Unknown Unknown
    matcha 000000000040E470 do_concurrent_sam 14 do_concurrent_s.f90
    matcha 000000000040C0B3 velocities 57 distribution_s.F90
    matcha 0000000000407D5F matcha 51 matcha_s.F90
    matcha 0000000000405213 main 18 main.F90
    matcha 0000000000404DBD Unknown Unknown Unknown
    libc-2.28.so 00007F5B4D330493 __libc_start_main Unknown Unknown
    matcha 0000000000404CDE Unknown Unknown Unknown
    `
    I don't have a good idea of the fault so far. investigation underway

Thanks for the update, @greenrongreen !

@greenrongreen does the above traceback mean that the failure is at line 14 in do_concurrent_s.f90? If so, the relevant line is an association with a an expression that includes an intrinsic function that is an important workhorse for this application:

13      do concurrent(cell = 1:ncells, step = 1:nsteps)
14        associate(k => findloc(speeds(cell,step) >= cumulative_distribution, value=.false., dim=1)-1)
15          sampled_speeds(cell,step) = vel(k)
16       end associate
17     end do

If line 14 above is the issue, there are some obvious workarounds with varying degrees of cost and complexity. The simplest workaround might be to eliminate associate altogether and instead to substitute corresponding expression inside the array index, which yields:

         sampled_speeds(cell,step) = vel(findloc(speeds(cell,step) >= cumulative_distribution, value=.false., dim=1)-1)

@Dominick99 tried this and it gets us past this error and on to the next error. I'll ask him to post the new error message in a comment here, but I do hope Intel will fix the issues with using associate inside do concurrent! As we've discussed it's a pretty important feature to me.

On a side, it would also be great if Intel could remove or increase the hard limit on nesting associate blocks. I have occasionally hit the limit, which if if I recall correctly is somewhere around 7 nesting levels.

@greenrongreen after replacing associate statements on lines 14 in do_concurrent_s.f90, 65 in distribution_s.F90, 51 and 57 in matcha_s.F90, and 86 on do_concurrent_s.f90, it appears that I got matcha to run using

./build/run-fpm.sh run --compiler ifx --flag  "-g -O0 -traceback -coarray"

@rouson should I push these changes to the branch 'use_ifx'?

@greenrongreen by removing associate statements, @Dominick99 got our application to compile with ifx and run with and without the GPU-offloading flags inside a virtual machine on his local machine. However, because he's running inside a virtual machine, I suspect that no actual offloading happening is happening. By contrast, the code crashes when I compile with fix Version 2023.0.0 Build 20221201 and run on a system at the University of Oregon with quad-socket 24-core Intel CooperLake CPUs (96 cores total) and Intel GPUs. Below are the steps to reproduce the problem:

git clone -b declare-ncells-as-c_int https://github.com/berkeleylab/matcha
cd matcha
fpm run --compiler ifx --flag  "-g -O0 -traceback -coarray -fopenmp-target-do-concurrent -fiopenmp -fopenmp-targets=spir64"

which yields

Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with
Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
do_concurrent_s.f90:14:14: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
forrtl: error (76): Abort trap signal
In coarray image 1
Image              PC                Routine            Line        Source             
libpthread-2.31.s  00007F52E9F1B420  Unknown               Unknown  Unknown
libc-2.31.so       00007F52E9D5200B  gsignal               Unknown  Unknown
libc-2.31.so       00007F52E9D31859  abort                 Unknown  Unknown
libomptarget.so    00007F52EA1B79A4  Unknown               Unknown  Unknown
libomptarget.so    00007F52EA1B8E02  Unknown               Unknown  Unknown
libomptarget.so    00007F52EA1B3ADF  __tgt_target_kern     Unknown  Unknown
libomptarget.so    00007F52EA1D0538  __tgt_target_team     Unknown  Unknown
matcha             000000000041404B  do_concurrent_sam          14  do_concurrent_s.f90
matcha             0000000000410F2F  velocities                 57  distribution_s.F90
matcha             000000000040874E  matcha                     52  matcha_s.F90
matcha             00000000004058AD  main                       18  main.F90
matcha             00000000004053FD  Unknown               Unknown  Unknown
libc-2.31.so       00007F52E9D33083  __libc_start_main     Unknown  Unknown
matcha             000000000040531E  Unknown               Unknown  Unknown

On the same machine, dropping the offload flags works fine: fpm run --compiler ifx --flag "-g -O0 -traceback -coarray".