open-mpi/hwloc

I/O device support

Closed this issue · 12 comments

Updated TODO-list:

  • Add iterators to find GPUs, NICs, ...
  • Update documentation
  • Find a pci lib for MacOSX (neither pciutils nor pciaccess seems available, and pciaccess doesn't expose the hierarchy of brdiges anyway)
  • Add some hwloc_insert_object_by_pcisomething, e.g. for a CUDA plugin which provides extended information to the object (e.g. number of streaming processors, etc.), which the core merges with the objects created by the libpci module.
    • provide functions like:

hwloc_obj_t hwloc_get_path_obj(hwloc_topology_t topo, const char *path);
hwloc_obj_t hwloc_get_fd_obj(hwloc_topology_t topo, int fd);

(the latter may return a network device or a disk device, depending on whether it's a socket or a file. Mmm and how about nfs-mounted files!)

Imported from trac issue 5. Created by bgoglin on 2009-09-24T01:01:40, last modified: 2011-04-05T17:17:38

Trac comment by bgoglin on 2009-09-25 16:50:04:

Using libpci to scan all devices is fairly easy actually. We can build the busid string from there, and then read /sys/bus/pci/devices/%s/{local_cpus,numa_node}. I have some code to fill the following kind of structure:
{{{
struct hwloc_iodev {
char busid; / ::. /
char *name; /
obtain from pci.ids, or NULL /
unsigned short vendor_id, device_id, device_class;
hwloc_cpuset_t local_cpus; /
mask of procs nearby /
unsigned numanode_osindex; /
numa node nearby */
}
}}}

numanode_osindex is probably useless, but I need to check that the kernel never set numanode without setting local_cpus properly. We could group all objects that have the same local_cpus/numanode_osindex, create doubly-linked lists of them, and attach the head of the list the lowest hwloc_obj that covers local_cpus. So each hwloc_obj_t will have two new fields:
{{{
hwloc_iodev_t *first_iodev, *last_iodev;
}}}
(ABI change: I wonder if we should save 16bytes now to prepare the future). And each hwloc_iodev would have a "rank" within this list and a pointer to its "parent" hwloc_obj.

We only gather the linear list of objects, we don't gather the actual PCI hierarchy (I don't think we care about it). We could filter the device class to only keep GPUs, NICs, and other HPC-related stuff so that lstopo remains interesting.

Once the hwloc core has this, we can add some specific helper such as "open an IB NIC near this cpuset" or "tell me the cpuset near this ibv_device that I just opened" (not hard to implement). We'll need same info for Cuda but we still haven't feedback from their developers about this.

Trac comment by bgoglin on 2009-09-28 01:48:07:

What's a properly set cpuset for a device? Do you want to add fake OS numbers to each device when discovering them ? (note for the implementers: modify cpuset of all objects covering this device when adding this fake OS number).

Also, on which level would these objects be stored? Are we breaking the rule that currently puts only objects with same type on the same level? Or do we add a new depth just for pci devices (and one for pci devices) ? (note to implementers: we'd have to make sure we don't put those below PROC)
And I guess we'll need a way to return "not comparable" from hwloc_compare_types().

By the way, GPUs will be inside socket in the near future, it's not only about machines and NUMA nodes :)

Trac comment by sthibaul on 2009-09-28 13:05:13:

What's a properly set cpuset for a device? Do you want to add fake OS
numbers to each device when discovering them ?

No, I meant cpu_set being the set of CPUs near the device.

Also, on which level would these objects be stored? Are we breaking the
rule that currently puts only objects with same type on the same level?

Well, I've never assumed this in my code actually :)

And I guess we'll need a way to return "not comparable" from
hwloc_compare_types().

Yes.

By the way, GPUs will be inside socket in the near future, it's not only
about machines and NUMA nodes :)

Right, and it's all the more interesting to be able to express that,
i.e. yes, have not only cache/cores objects in sockets.

Trac comment by bgoglin on 2009-09-28 13:22:14:

Also, on which level would these objects be stored? Are we breaking the
rule that currently puts only objects with same type on the same level?

Well, I've never assumed this in my code actually :)

I think we should have such a rule, otherwise things may become a huge mess if we ever break it.

And I think we should also document all such rules about somewhere, for instance with the ones about PROC being below, SYSTEM being above, cpusets not intersecting between children, cpusets possibly being empty IIRC (for empty NUMA nodes and devices?), ... Maybe put all this near hwloc_topology_check() and complete what this function actually checks.

And I guess we'll need a way to return "not comparable" from
hwloc_compare_types().

Yes

We need to change hwloc_compare_types() as soon as possible then, it is supposed to return <0, 0 or >0 only for now. Otherwise we'll break the ABI when adding devices in post-1.0.

Trac comment by sthibaul on 2009-09-28 13:33:05:

Oops, reading again:

Are we breaking the rule that currently puts only objects with same type on the same level?

No, I don't mean that. I mean another level for PCI buses, another
level for GPUs, another level for Network boards, etc. But without any
strict inclusion order wrt to the levels enclosing CPUs, i.e. a PCI bus
could be along sockets in a machine, or along NUMA nodes in a machine.
A GPU could be along caches+cores in a socket.

And I think we should also document all such rules

Yes.

We need to change hwloc_compare_types() as soon as possible then, it is supposed to return <0, 0 or >0 only for now. Otherwise we'll break the ABI when adding devices in post-1.0.

We can use MAX_INT, MAX_INT-1, etc. as special values (#defined to some HWLOC macro of course).

Trac comment by bgoglin on 2009-09-29 01:41:12:

Actually, maybe hwloc_compare_types() could also return 0 for non-comparable types. People would be advised to check if the types values are indeed equal when hwloc_compare_types returns 0.

Trac comment by sthibaul on 2009-10-22 11:40:47:

os devices (e.g. eth0, ide0, hda, sda, etc.) should probably be yet other kinds of
objects: a RAID PCI device may have several disks, an Ethernet board
may have several net devices, etc. This can be seen e.g. in

/sys/bus/pci/devices//ide0//block/*
/sys/bus/pci/devices//net/
/sys/bus/pci/devices//host/target_//block/*
/sys/bus/pci/devices/
/drm/_

(could use glob() to get these)

Trac comment by sthibaul on 2009-10-29 14:43:10:

We could also provide functions like:

hwloc_obj_t hwloc_get_path_obj(hwloc_topology_t topo, const char *path);
hwloc_obj_t hwloc_get_fd_obj(hwloc_topology_t topo, int fd);

(the latter may return a network device or a disk device, depending on
whether it's a socket or a file. Mmm and how about nfs-mounted files!)

Trac comment by bgoglin on 2011-03-29 13:33:19:

big TODO update

Trac comment by bgoglin on 2011-04-05 17:17:08:

Fixed in r3381.

Trac comment by bgoglin on 2011-04-05 17:17:38:

move to v1.3