dash-project/dash

hwloc and LIKWID Versions

Opened this issue · 22 comments

It would be very cool to have minimum/maximum Version numbers for these in the readme. For both libraries there can be build problems if the wrong version is loaded. E.g. it doesn't work with hwloc@2.0.2, but with hwloc@1.11.11. If I then add likwid@4.3.2 there are build errors again. I have no other Versions of likwid installed and I would like to know which one will work (newer or older than 4.3.2) before I try to build likwid.
Also it would be nice to know how DASH benefits exactly from these (and other) libraries, to know if they are worth the effort for a given project.

To be more precise: with likwid@4.3.2 and hwloc@1.11.11 I get

[  2%] Building C object dart-impl/base/CMakeFiles/dart-base.dir/src/internal/domain_locality.c.o
/home/spielix/dash/dart-impl/base/src/hwinfo.c: In function ‘dart_hwinfo’:
/home/spielix/dash/dart-impl/base/src/hwinfo.c:148:27: error: ‘dart_hwinfo_t’ {aka ‘struct <anonymous>’} has no member named ‘num_sockets’; did you mean ‘num_cores’?
       hw.num_numa    = hw.num_sockets;
                           ^~~~~~~~~~~
                           num_cores
/home/spielix/dash/dart-impl/base/src/hwinfo.c:151:53: error: ‘dart_hwinfo_t’ {aka ‘struct <anonymous>’} has no member named ‘num_sockets’; did you mean ‘num_cores’?
       hw.num_cores   = topo->numCoresPerSocket * hw.num_sockets;
                                                     ^~~~~~~~~~~
                                                     num_cores
make[2]: *** [dart-impl/base/CMakeFiles/dart-base.dir/src/hwinfo.c.o] Error 1

Without likwid but with hwloc@2.0.2 I get

[  3%] Building C object dart-impl/base/CMakeFiles/dart-base.dir/src/internal/domain_locality.c.o
/home/spielix/dash/dart-impl/base/src/hwinfo.c: In function ‘dart_hwinfo’:
/home/spielix/dash/dart-impl/base/src/hwinfo.c:313:33: error: ‘struct hwloc_obj’ has no member named ‘memory’
     hw.system_memory_bytes = obj->memory.total_memory / BYTES_PER_MB;
                                 ^~
/home/spielix/dash/dart-impl/base/src/hwinfo.c:319:33: error: ‘struct hwloc_obj’ has no member named ‘memory’
       hw.numa_memory_bytes = obj->memory.total_memory / BYTES_PER_MB;
                                 ^~
make[2]: *** [dart-impl/base/CMakeFiles/dart-base.dir/src/hwinfo.c.o] Error 1

@Spielix You're right, we should document the minimum requirements. We should also support hwloc v2 eventually. I'll see what I can do.

In the meantime, do you need support for either likwid or hwloc-2?

Like I said, first of all I would like to know what difference it makes to have these. I mean when one knows the library one may be able to imagine for what I it may be used in DASH, but not every use-case one can imagine may be implemented (yet), etc.
I would like to know the benefits and - if there are any - the drawbacks of building DASH with these 3rd party libraries enabled.

@Spielix Ahh I see, sorry I didn't fully grasp your question. You should be able to safely build DASH without these two libraries. They are mainly used to query information on the machine you're running on, e.g., the number of cores and the size of memory. In most cases, none of that is crucial for using DASH though and DASH will fallback to Linux APIs to query some of this information if neither Likwid nor hwloc is available. You're safe disabling both Likwid and hwloc entirely and build without them...

When I build using hwloc@1.11.11 and run dash-test-mpi with e.g. -host mynode:4, the tests need a very long time (producing more than 11,123 lines of output) and ultimately seem to fail:

[    0 ERROR ] [ 1212202943.391 ] locality.c               :617  !!! DART: dart__base__locality__domain_group ! group subdomain .0.0.1.0.1.0.0.0 with invalid parent domain .0.0.1.0.1.0.0.0                 
 ^[[0;31m[  ERROR   ] ^[[m[UNIT 0] in [=  0 LOG =]               TestBase.h : 287 | -==- Test case finished at unit 0

When I leave away the :4 everything works fine, as without hwloc. As my application(s) are quite communication heavy, I was thinking about having only one process per node instead of one per NUMA node either way. So it's not the end of the world, but I still would like to know if this behavior is to be expected.

I would love to get some comment on this. Is this

  • b/c one shouldn't use :number_of_slots,
  • b/c of the hwloc Version,
  • b/c of some configuration issues on my side,
  • or is there a bug in dash-test-mpi or even DASH itself?

It is hard to say what is going on just from the error you posted. Could you give some more information your platform, your MPI, and which test exactly fails?

Well, you not knowing where it comes from is pretty much enough information for me to just drop hwloc.

As you may want to go further:

  • The single node I ran the test on has 2x Intel Xeon Platinum 8168 @2.70GHz
  • I use openMPI@3.1.5, gcc@8.2.0 and openBLAS@0.3.7.
  • I also tried to run the test in an multi-node setting with more than one slot assigned to each node, but it took so long that I canceled it b/c I thought it was having the same problem and was producing just a ton of output.
  • Here you have all 11,123 lines of output:
    slurm-696394.out.txt
  • I run the test using sbatch -w mp-skl2s24c --wrap "`which mpirun` -host mp-skl2s24c:4 -x LD_LIBRARY_PATH ./dash/dash-test-mpi"

It seems that the locality part of the runtime trips over something in your setup. Unfortunately, I do not know enough about that part to quickly figure things out. Here is the relevant code (https://github.com/dash-project/dash/blob/development/dart-impl/base/src/locality.c#L611):

      if (group_subdomain_tag_len <= group_parent_domain_tag_len) {
        /* Indicates invalid parameters, usually caused by multiple units
         * mapped to the same domain to be grouped.
         */
        DART_LOG_ERROR("dart__base__locality__domain_group ! "
                       "group subdomain %s with invalid parent domain %s",
                       group_subdomain_tags[sd], group_parent_domain_tag);

AFAICS, the hwloc part is only really relevant if you plan to split teams based on hardware information (grouping all units on one node into a team for example). If not it's safe to ignore hwloc...

Maybe @fuchsto can shed some light on what is going wrong here?

@Spielix You mentioned that the test run takes significantly longer if you place four units on the same node. That is surprising because most of the tests are single-threaded. Can you make sure that the processes are not bound to the same core? Can you try running with --bind-to none passed to MPI?

The amount of output is expected, that's the normal test output.

The bind-to none doesn't seem to change anything. I tested it again (on a different node) and with the error the output is again over 10k lines, while when I don't specify the number of slots, I only get 3k lines of output (with all tests passed.)

EDIT There is ca a factor of two in runtime.

This node has 2x Intel Xeon Silver 4110 @2.10GHz. With error it takes about 3 minutes, without it takes about 1.5 minutes.

@Spielix If you don't specify :4 the test runs with a single unit only. Many of the tests require at least 2 units, some more, so naturally the output is significantly smaller. That might actually also explain the longer runtime...

I guess the test that is failing is one of the ones not being run with only one slot?

It wouldn't be too surprising if this was a setup issue, as the admins are mostly working on single nodes. We had problems with the MPI setup before. Although I would have thought that this would only show when one uses more than one node...

Can you try to launch one unit per node to see if the issue persists there? (if multi-node runs are part of your use-case)

I can, but the the only type I have 4 nodes of is knl, so the single cores are very slow (the network is slow too.). When I tried it I got a seemingly different error. The problem is that when I try doing the test with the non-hwloc build, it runs for over 8 hours on the 4 nodes and that is the maximum amount of time I can use. As it didn't take long for the hwloc version to error on the 4 nodes, I still would guess that the error is not happening w/o hwloc. You can confirm this in the output if you want:
slurm-696581_hwlocerror_4nodes.out.txt
slurm-696625_4nodes_timeout.out.txt

As you can see in the new issue I have found out what stopped the non-hwloc run from working (I thought that it couldn't be that slow/that much to test). I actually didn't use the right branch. With the development branch the code works fine on the four nodes w/o hwloc meaning that the error appearing in "slurm-696581_hwlocerror_4nodes.out.txt" is also due to hwloc. I don't know if or how it is related to the case with several units on one node though.