openucx/ucx

Fails to build 1.16.0 on ppc64

amckinstry opened this issue · 6 comments

Describe the bug

Hi. Debian UCX maintainer here.

UCX 1.16.0 fails to build on ppc64 / ppc64el:
https://buildd.debian.org/status/fetch.php?pkg=ucx&arch=ppc64el&ver=1.16.0%2Bds-3&stamp=1709458389&raw=0

Steps to Reproduce

Configuration here:

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • cat /etc/issue or cat /etc/redhat-release + uname -a
    • For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • rpm -q rdma-core or rpm -q libibverbs
      • or: MLNX_OFED version ofed_info -s
    • HW information from ibstat or ibv_devinfo -vv command
  • For GPU related issues:
    • GPU type
    • Cuda:
      • Drivers version
      • Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv

Additional information (depending on the issue)

  • OpenMPI version
  • Output of ucx_info -d to show transports and devices recognized by UCX
  • Configure result - config.log
  • Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

Apologies, submitted too soon.

The problem appears to be that PPC64 lacks the new ucm_bistro_lock_t in 1.16.0:
eg from x64_64.h:

/* Patching lock for other flows exclusion */
typedef struct ucm_bistro_lock {
    uint8_t jmp[2]; /* jmp self or immediate next instruction */
} UCS_S_PACKED ucm_bistro_lock_t;

/**
 * Helper functions to improve atomicity of function patching
 */
void ucm_bistro_patch_lock(void *dst);

There is no equivalent for PPC64.

Could you please confirm if #9726 is enough to fix it?

Looks good. Testing with overnight build

It could also be interesting to further confirm by checking that ucx_info -d runs properly, if possible.

I don't get a login to our CI/CD machines, but will add a ucx_info -d test to the pipeline.
It all builds fine. Some existing MPI tests running at the moment.

ok mpi tests running are fine, no need to check ucx_info -d then.