intel/pti-gpu

oneprof crashes when using mpirun + workload that calls make

guoyejun opened this issue · 0 comments

It is in a single node (localhost in the hostfile), and the command line looks like:
oneprof -i -p ~/oneprof_log/ -o ~/oneprof_log/oneprof.log mpirun -n 2 -ppn 2 -hostfile hostfile_mpich python -u pretrain_gpt.py ...

in the python script pretrain_gpt.py, 'make' is called at https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/data/dataset_utils.py#L82, also copy here for your convenience.

def compile_helper():
    """Compile helper function ar runtime. Make sure this
    is invoked on a single process."""
    import os
    import subprocess
    path = os.path.abspath(os.path.dirname(__file__))
    ret = subprocess.run(['make', '-C', path])
    if ret.returncode != 0:
        print("Making C++ dataset helpers module failed, exiting.")
        import sys
        sys.exit(1)

and the command crashes even if the 'make' does not call the compiler because the target (.so file) is newer that its dependent files.

And it runs successfully if I disable that line to not call make.