pnnl/lamellar-runtime

Failed to build Lamellar with rofi

bingao opened this issue · 21 comments

Hi,

I tried to build Lamellar on a Linux cluster. I first downloaded Lamellar as git clone https://github.com/pnnl/lamellar-runtime, and then built it usingcd lamellar-runtime && cargo build --features enable-rofi. I got the following error message:

  --- stderr
  autoreconf: Entering directory `.'
  autoreconf: configure.ac: not using Gettext
  autoreconf: running: aclocal --force -I config
  autoreconf: configure.ac: tracing
  autoreconf: running: libtoolize --copy --force
  autoreconf: running: /usr/bin/autoconf --force
  autoreconf: running: /usr/bin/autoheader --force
  autoreconf: running: automake --add-missing --copy --force-missing
  configure.ac:66: installing 'config/compile'
  configure.ac:17: installing 'config/missing'
  Makefile.am: installing 'config/depcomp'
  autoreconf: Leaving directory `.'
  ln: failed to create hard link 'src/.libs/libfabric.lax/lt1-libfabric_la-osd.o' => 'src/linux/libfabric_la-osd.o': Operation not permitted
  autoreconf: Entering directory `.'
  autoreconf: configure.ac: not using Gettext
  autoreconf: running: aclocal --force --warnings=none -I m4
  autoreconf: configure.ac: tracing
  autoreconf: running: libtoolize --copy --force
  autoreconf: running: /usr/bin/autoconf --force --warnings=none
  autoreconf: running: /usr/bin/autoheader --force --warnings=none
  autoreconf: running: automake --add-missing --copy --force-missing --warnings=none
  configure.ac:5: installing './compile'
  configure.ac:2: installing './missing'
  src/Makefile.am: installing './depcomp'
  autoreconf: Leaving directory `.'
  configure: WARNING: rdma/fabric.h: accepted by the compiler, rejected by the preprocessor!
  configure: WARNING: rdma/fabric.h: proceeding with the compiler's result
  configure: error: 
  thread 'main' panicked at /cluster/home/gaobin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/autotools-0.2.6/src/lib.rs:781:5:

  command did not execute successfully, got: exit status: 1

  build script failed, must exit now
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

It seems to be the error of creating a hard link. But how to solve it?

Moreover, I also tried to build rofi-sys manually as:

git clone https://github.com/pnnl/rofi-sys.git
cd rofi-sys
cargo build

It still failed with the error message:

error: failed to run custom build command for `rofisys v0.3.0 (/cluster/home/gaobin/rofi-sys)`

Caused by:
  process didn't exit successfully: `/cluster/home/gaobin/rofi-sys/target/debug/build/rofisys-54d7b9fa7736e779/build-script-build` (exit status: 101)
  --- stdout
  running: cd "/cluster/home/gaobin/rofi-sys/target/debug/build/rofisys-0104480a2efda797/out/ofi_src" && "sh" "-c" "exec \"$0\" \"$@\"" "autoreconf" "-ivf"

  --- stderr
  autoreconf: 'configure.ac' or 'configure.in' is required
  thread 'main' panicked at /cluster/home/gaobin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/autotools-0.2.6/src/lib.rs:781:5:

  command did not execute successfully, got: exit status: 1

  build script failed, must exit now
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I am not sure how this error could happen and does it have relationship with the error of building Lamellar?

Thanks.

Thanks for the issue!

Having lamellar build the libfabric and rofi c dependencies automatically is still a bit of a work in progress so I appreciate you submitting this.

Let's first see if you can build rofi directly from the c project, github.com/pnnl/rofi, this will require you to also have a working install of libfabrics.

In the meantime, time I will try to look a bit more into why it's failing when lamellar is building it.

Thank you @rdfriese I followed your suggestion. I successfully built and installed libfabrics in ${HOME}/local. Then I downloaded rofi and tried to do configuration using

./configure --prefix=${HOME}/local CPPFLAGS="-I/cluster/home/gaobin/rofi/uthash/include -I/cluster/home/gaobin/local/include" CFLAGS=-O3 LDFLAGS=-L/cluster/home/gaobin/local/lib

The error message I got is:

checking for data segment pointer... not found
configure: error: Could not locate data segment

I read configure.ac and the error seems to be relevant for the compiler(s) if I understood correctly. Anyway, I used gcc (GCC) 13.2.0. Do you know how to solve this problem? Thank you.

Ahh I think for this one, you need to set LD_LIBRARY_PATH to include libfabrics.so.

I'll make a point to either try an eliminate this need or at least point it out in the readme

(Unfortunately this may actually be unrelated to the lamellar issue)

Thanks @rdfriese The problem is solved. I am trying to build rofi and lamellar now.

Great!

You may need to set the following environment variable before calling cargo build

OFI_DIR=libfabrics.so install dir
ROFI_DIR=librofi.so install dir

With respect to your earlier issues, do you happen to have the OS you are running on, along with the version of autotools installed?

Hi @rdfriese The OS I use is Rocky Linux 9.1 (Blue Onyx). I tried manually to perform a hard link. The OS does not allow a hard link in a different directory. I do not know why, but that is the reason of my earlier issues when compiling lamellar.

Hi @rdfriese After manually building libfabric and rofi, I built Lamellar using cargo build --features enable-rofi -v. I saw Lamellar was to be built:

     Running `rustc --crate-name lamellar --edition=2021 src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --diagnostic-width=118 --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="default"' --cfg 'feature="enable-rofi"' --cfg 'feature="libc"' --cfg 'feature="rofisys"' -C metadata=4cc710ac2664dd1b -C extra-filename=-4cc710ac2664dd1b --out-dir /cluster/home/gaobin/lamellar-runtime/target/debug/deps -C incremental=/cluster/home/gaobin/lamellar-runtime/target/debug/incremental -L dependency=/cluster/home/gaobin/lamellar-runtime/target/debug/deps --extern anyhow=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libanyhow-37ad61e3b6f0599b.rmeta --extern async_lock=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libasync_lock-230caa3cc819fc39.rmeta --extern async_recursion=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libasync_recursion-a49f3a61f77d668b.so --extern async_std=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libasync_std-c29d122837105bde.rmeta --extern async_task=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libasync_task-e72f92038ccfc166.rmeta --extern async_trait=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libasync_trait-cddacaf0e828bdc5.so --extern bincode=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libbincode-1305275cf88639ee.rmeta --extern core_affinity=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libcore_affinity-fd8a68f4f890982a.rmeta --extern crossbeam=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libcrossbeam-a4b4254a6bf89a60.rmeta --extern custom_derive=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libcustom_derive-5ff07f976469020f.rmeta --extern enum_dispatch=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libenum_dispatch-e78758f0f20c2eba.so --extern futures=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libfutures-d3c8eba82bb3dcc9.rmeta --extern futures_lite=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libfutures_lite-ea9ea145f1b72c4a.rmeta --extern glob=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libglob-3cff88ec4a7e6167.rmeta --extern indexmap=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libindexmap-3af474aa4d3bf132.rmeta --extern inventory=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libinventory-e1da8755b86521ff.rmeta --extern itertools=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libitertools-b9e7245d7e1da720.rmeta --extern lamellar_impl=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/liblamellar_impl-1f17aa0d71e56263.so --extern lazy_static=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/liblazy_static-d0c2dc85fa90a848.rmeta --extern libc=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/liblibc-3ce6d1526e8f1618.rmeta --extern memoffset=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libmemoffset-d07caee0a6e0d778.rmeta --extern newtype_derive=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libnewtype_derive-61bcd3628007834a.rmeta --extern parking_lot=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libparking_lot-139c185740be61ed.rmeta --extern paste=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libpaste-938fb0da6277752f.so --extern pin_project=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libpin_project-a3eb2da3d3fa7248.rmeta --extern pin_weak=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libpin_weak-049ea2e57c00ebd5.rmeta --extern rand=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/librand-2be74aa515316ef5.rmeta --extern rofisys=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/librofisys-b9b3059992534932.rmeta --extern serde=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libserde-f1d325b46856a435.rmeta --extern serde_bytes=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libserde_bytes-024bb01115920269.rmeta --extern serde_with=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libserde_with-12dae7ba93a00ae5.rmeta --extern shared_memory=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libshared_memory-be5936f8d28a36ca.rmeta --extern thread_local=/cluster/home/gaobin/lamellar-runtime/target/debug/deps/libthread_local-76f877a889fdcc1f.rmeta -L native=/cluster/home/gaobin/local/lib/lib -L native=/cluster/home/gaobin/local/lib/lib -L native=/usr/lib64 -L native=/usr/lib64/libibverbs`

But it took around 25 minutes to finish. Anyway, I will try to build with --release option and build some examples.

Another question: can Lamellar be built on Macbook? I tried to build on my Macbook without ROFI as cargo build --examples, and the procedure was killed due to lack of memory. Do you know what the problem might be? Thank you.

Lamellar does take a while to compile unfortunately, but 30 minutes certainly seems excessive. I suppose it does depend on the underlying hardware, I think the slowest I've tested is in the range of 15min, and on the fastest system I've seen is maybe 7 min

I don't think the gcc versions should be an issue.

Just to recap, in addition to manually building libfabric and rofi manually, you also have set OFI_DIR and ROFI_DIR to the relevant paths?

Have you tried compiling without Rofi enable? So we can try to narrow it down to rofi/libfabrics being the issue?

Yes, I did set OFI_DIR and ROFI_DIR to the relevant paths. Because I installed libfabric and rofi in my home directory, I also needed to add the directory of header files to GCC path:

export C_INCLUDE_PATH=${HOME}/local/include/

I have not compiled without ROFI. But I can try and will post the results here.

Hi @rdfriese I did finish the building procedure of Lamellar. It reported that the procedure took around 25 minutes. I will try to build with --release option and build some examples.

Another question: can Lamellar be built on Macbook? I tried to build on my Macbook without ROFI as cargo build --examples, and the procedure was killed due to lack of memory. Do you know what the problem might be? Thank you.

I have tested on an M1 MacBook pro with 16GB and it compiles, but I haven't tested with anything less than that amount of memory.

I think unfortunately it may have something to do with linker, as I have seen memory usage spike when everything gets linked into the final executable. I've seen some other folks online mention using a different linker than the system default, but don't have experience myself doing that

When building examples I would suggest limiting how many are built in parallel using the cargo build "-J" option otherwise you will almost assuredly run out of memory, again due to the linking stage

Thanks @rdfriese My Macbook also has 16 GB and CPU is 2,6 GHz 6-Core Intel Core i7. Maybe other programs have used much memory when I built Lamellar, I will try again later.

Hi @rdfriese I did some more tests. First, I tried to build Lamellar automatically on the Linux cluster by using cargo build --release. It compiled successfully. So my earlier issues regarding the problem of hard links happened only when ROFI was enabled. I am not sure if it is our computer's problem and/or the problem of ROFI.

Secondly, I built Lamellar on my MacBook with cargo build --release --examples --jobs 1. It succeeded as well. So I guess my previous memory problem is due to the missing --jobs 1.

Thirdly, I also tried to build Lamellar on MacBook with ROFI enabled (well, this is just for debugging codes) by using cargo build --release --examples --jobs 1 --features enable-rofi. But since there is no verbs provider on my MacBook, the building of rofi-sys failed. So does it mean one cannot build Lamellar on Mac with ROFI enabled?

Thank you.

Thanks for all the tests!

For Rofi and MacBook, currently this is the expected behavior, the next release of Rofi should relax this constraint and work on systems when verbs are not present by falling back to using the libfabric tcp provider.

For the Linux cluster, I'm not sure why it is failing with Rofi enabled, maybe running cargo with the "-vv" flag would tell us more where it is getting stuck?

Hi @rdfriese I built Lamellar on the Linux cluster by using cargo build --release --jobs 1 --features enable-rofi -vv. It stopped with the error message in the attached build.log.

There seem to be two errors:

  rpm: symbol lookup error: /lib64/librpmio.so.9: undefined symbol: lzma_stream_encoder_mt, version XZ_5.2
  /usr/bin/pkg-config: line 8: /usr/bin/-pkg-config: No such file or directory

and

  ln: failed to create hard link 'src/.libs/libfabric.lax/lt1-libfabric_la-osd.o' => 'src/linux/libfabric_la-osd.o': Operation not permitted

Both are related to Linux system's issues. Do you have any clue about them?

I also attached config.log in the directory lamellar-runtime/target/release/build/rofisys-66e7ae24725fc5a2/out/build. Hope it will be helpful.

Thanks for the logs, as best I can tell is it appears libfabrics is being built appropriately, but then when it gets to build rofi, the rofi configuration script can't find the libfabric we just built.

Just to verify, could you do an 'ls /cluster/home/gaobin/rust/lamellar-runtime/target/release/build/rofisys-66e7ae24725fc5a2/out/lib'

So we can see if the libfabric libraries actually got built?

Thanks @rdfriese What I got is:

[gaobin@login-3.SAGA ~]$ ls /cluster/home/gaobin/rust/lamellar-runtime/target/release/build/rofisys-66e7ae24725fc5a2/out/lib
libfabric.a  libfabric.la  pkgconfig
[gaobin@login-3.SAGA ~]$ 

That's a good sign at least, but I wonder if this has to do with trying to use a static vs dynamic library. I'll try to debug a bit more on my end (probably not until tomorrow unfortunately) and get back to you!

Hi @bingao, after being able to look at your logs in a bit more detail, I think the issue is related to libfabric trying to include libxpmem even though Rofi does not utilize it, which was leading to undefined symbols. This was an issue in the libfabric configure script, so I have updated the libfabric dependency, and explicitly disable using xpmem (for now at least) in the Rofi-sys build script.

To test whether this solves the issue or not, could you please try replacing the rofisys = { version ="0.3", optional = true } line in the lamellar runtime cargo.toml file with rofisys = {git = "https://github.com/pnnl/rofi-sys.git", branch = "master", optional = true}

Thanks, @rdfriese I can confirm it works by using rofisys = {git = "https://github.com/pnnl/rofi-sys.git", branch = "master", optional = true}.