Interface to query CPU setup, numa information and set affinity
ctk21 opened this issue · 5 comments
It might be useful (at least experimentally) to add an interface that allows the user to easily:
- query the number of CPUs in a system
- query the NUMA information in a system
- set the affinity of a domain to a given CPU
Open questions for this are:
- what exactly should the API be?
- what cross platform Linux vs Windows vs Other considerations should be taken on board?
I'd like to take up on this, I wrote the cpu topology used in OpenBSD ages ago so hopefully I can contribute something.
@dra27 and @kayceesrk expressed interested in this so I'm CCing them.
What follows is just a collection of thoughts, so take it all with a grain of salt, please :).
Topology
On the very basic, we could provide a smt (thread) count, a core count and a package (socket) count, and also the relationship between them and so on.
On a more elaborate approach we would have to consider core asymmetry (like intel 12th gen) and modern arm64, as well as cache groups like amd's CCX. Maybe num_performance_cores
num_efficient_cores
can become a thing. Online vs offline cpus need some investigation, I'm not sure this is actually meaningful anywhere.
Domainsllib could provide this, but I don't think it belongs in the standard library.
The standard library could provide num_threads
at most as anything else would impose too many requirements on the underlying system, num_threads
can be usually retrieved with a sysconf
and/or sysctl
and it's widely available.
To build the topology on x86 we need to be able to at least retrieve the apicid
of each smt-thread, if that's possible we can at least build smt<>core<>package
relationship, without having to resort to more operating system support like sysctl/sysconf and whatnot. That would be a more democratic approach, we can loop over all threads, set affinity one by one, call cpuid
to retrieve the current apicid
and then build the relationship tree.
We can't assume x86 or POSIX, we also can't cover every operating system and architecture so we might consider making Domainsllib able to fail on any of the queries, I think this is better than silently returning single core
.
I haven't dived into Windows but I'm sure it provides this information somewhere, I'll write from the top of my head what I know:
Topology retrieval, from top of my head
- Linux: Yes
- OpenBSD: dmesg parsing
- FreeBSD: Full, xml tree via sysctl (I kid you not)
- NetBSD: No
Except for OpenBSD we can also just build our own by retrieving the pinning and fetching the apicid
for each core (for x86).
Nomenclature
Naming this is kinda tricky, people use threads, cores, cpus with different meanings, also a lot of the jargon is inheriting from x86, like smt (maybe sparc!?) and packages, at any rate this must be considered.
I personally like the idea of calling cpu
the logical thing, as in the actual thread
, but then thread
become a synonym for cpu
which is bad. Also cpu
is more often than not used as processor
, which in turn is way less ambiguous.
NUMA
I don't think any operating system provides any active toggles other than CPU affinity to a userland process, so pinning memory zones (as in the zones the acpi table gives you). Usually what they do (at least Linux/FreeBSD/NetBSD) is to try to map pages belonging to a memory zone of the affinity cpu, so if you say set_affinity(cpu1)
it will try to map the pages "closer" to the die containing cpu1.
edit: I'm completely wrong here, linux has set_mempolicy(2), mbind(2) and more.
Affinity
Domainsllib could provide something like Cpuset.t -> ('a, 'e) result
, I think we need to be able to fail since not every operating system provides those, some POSIX systems just fail silently on pthread_setaffinity_np(3)
though. At any rate we should tell the caller that "we can't set the affinity" if possible.
Set affinity support, from top of my head
- Linux: Full
- OpenBSD: Fails silently, nothing is done at kernel level
- FreeBSD: Full
- NetBSD: Full
So I'm building this: https://github.com/haesbaert/ocaml-cpu
So far only support for getting number of threads, setting and getting cpu affinity, works without multicore/Domains, only linux but I'll work on the rest.
num_threads: unit -> int
set_affinity: int list -> unit
get_affinity: unit -> int list
Thanks @haesbaert. In general, we're avoiding the term "threads" in OCaml 5 since we have multiple notions of threading -- domains, fibers and systhreads. I'm leaning towards num_cores
, where I'm using "cores" as a proxy for available units of parallelism. If hyperthreading is enabled, doesnum_cores
return the number of hardware threads or physical cores?
Before we expose set_affinity
and get_affinity
, do we know that it is useful in practice for programs that use domainslib today? We have sandmark nightly benchmarking runs on the 64 core, 128 thread navajo
machine: https://sandmark.tarides.com. We can experiment with affinity there. Also, the API of *et_affinity
may need to operate on the pool
abstraction.
If we avoid the term "threads" I think "cpus" should be used to refer to "logical cpus/threads". I believe most people associate "num_cores" with an actual core (as in the parent of threads).
As discussed on Slack, most OSes usually just give us a "get_cpu_count" which return the total number of logical cpus (aka threads), so if hyperthreading is enabled it will return twice the number of cores, if disabled, threads==core.
From now on I'll refer to "cpus" as in: total logical cpus available (total number of threads).
Retrieving anything more than "total number of cpus" is very OS/MD dependent, Linux would involve parsing /proc
or making a trip to each cpu and implementing the CPUID
dance ourselves. Parsing /proc is ugly and won't work on chroot
environments, doing the CPUID
dance ourselves is a bit tricky, first because it's completely MD, second because the CPUID
leafs tend to change, sometimes even between intel and amd.
Before we expose
set_affinity
andget_affinity
, do we know that it is useful in practice for programs that use domainslib today? We have sandmark nightly benchmarking runs on the 64 core, 128 threadnavajo
machine: https://sandmark.tarides.com. We can experiment with affinity there. Also, the API of*et_affinity
may need to operate on thepool
abstraction.
I'm not sure, it probably depends a lot on the OS and socket configuration, I assume Linux does a decent job at keeping the pthreads on cpus of the same socket.
There is a social aspect to this discussion, think that no project out there is exposing much more than "number of cpus", I guess the affinity/pinning, when relevant, is done outside via taskset
and similar. If you're reaching the point where you're using affinity, you probably know your architecture well enough to ponder about it (as in: it's your job to read /proc
), that would be the case of https://sandmark.tarides.com where we know before hands where each cpu/core/socket is.
I have released https://github.com/haesbaert/ocaml-processor which hopefully addresses this.