Include NUMA information

Question

Include NUMA information

Closed this issue 2 months ago · 8 comments

From a discussion with Marcin about Sonar on Betzy:

Include individual core utilization (high/low) as mask in ps data: NAICNO/Jobanalyzer#508
Include NUMA architecture info in sysinfo data: NAICNO/Jobanalyzer#509

The utility of the data is to be able to look for wildly unbalanced jobs automatically.

Answer 1 · 2024-06-17T12:33:32.000Z

/sys/bus/node/devices/nodeN/cpuM/topology has a lot of useful information about the NUMA topology of the system, notably core_siblings / core_siblings_list will list siblings of the core, core_cpus_list the ids that are part of the core (something like that) - info of the type we get with numactl -H. /sys/bus/node/devices/nodeN/numastat shows NUMA statistics, and .../distance has the distance map. .../nodeN/cpumap and .../nodeN/cpulist has compact information about some of that.

So that's all fine (or we could run numactl -H).

For the individual core utilization, htop does it and mpstat supposedly does it (not installed on fox). mpstat source code here: https://github.com/sysstat/sysstat/blob/master/mpstat.c.

Answer 2 · 2024-06-18T13:07:57.000Z

/proc/stat has all the info we need for per-core utilization. The question is, what information do we send? I am mostly interested in "how busy has the cpu been the last step", which could be a two-bit log-ish value: 0-12.5%, 12.5%-25%, 25%-50%, 50%-100%. The problem is how to extract that from /proc/stat. The values in that file are ticks since boot for each core but there is no history. We can send all the data to the server which can then reconstruct the history, but oy! what a lot of data. Even if we scale it to seconds it'll be six digits per core, 128 cores yields 900 bytes incl separators - just for a single value per core. Not the end of the world but a large amount more than we have now. We may be able to delta-code off some base, that might help. At four bits per digit we should also not use ASCII but maybe base64 or some custom compression table. It seems maybe plausible to get it down to 300 bytes per record. Still a lot. Maybe only do it under a flag.

Answer 3 · 2024-06-18T13:20:18.000Z

Maybe there is a lesson here. At the moment, the sonar output is one line per process, with a fair amount of redundant information: version, timestamp, host - amounting to 53 bytes per record. The cpu usage info is in that same category: it is system-wide, not per process. (core count and memory total had the same issue, before we removed them, although they were as bad as the hostname: constant across time.)

It's always risky to add structure but it may be that we should change the data format. We could go to JSON. But we could also do a more lightweight csv (it's less work). Say the first record in a data package is:

v=0.11.1,timestamp=...,host=...,load=...

(where load is the vector of core utilization data) and then subsequent records are a little different:

*,user=...,cpu%=...,cmd=...

that is, per-process data as now. To make this even more resilient it would be possible for the first record to include a randomly generated ID that is repeated in the payload lines, but really it shouldn't matter except when multiple sonar runs mix their output, which has been safe until now but not a great idea.

On the ingestor side this would amount to creating transactions when appending to files, but this is easy now since there is one writer for a file, not a bunch of random processes.

Answer 4 · 2024-06-18T20:19:13.000Z

An encoding experiment: on ML1 (56 cores), up 182 days (fairly typical), delta-encoding (relative to the minimum value) the user+nice+system values converted to seconds, then converted to an uleb128 stream without separators, then base64-encoded, yields 169 bytes of binary data and 226 bytes of text data, four bytes per core - more than i feared. The reason is that one of the cores is a real outlier so the delta-encoding does not yield much - every value except one is a six-digit number (well, two are seven-digit). One can be smarter here but the real fix is likely to not print this array once per process, but once per sonar run.

Answer 5 · 2024-06-18T20:24:51.000Z

(Instead of uleb delta to the minimum one could possibly do sleb delta to the median, may be worth trying for hack value, but the point about not printing redundancies stands.)

Answer 6 · 2024-06-19T05:44:50.000Z

Here's an encoding trick that is simpler than the "structured csv" proposed above: For per-node information such as the per-cpu load, the number of cores, and the amount of memory, it is enough to emit the field once per sonar invocation, i.e. it can be emitted within any of the records for a (timestamp, hostname) pair, and sonalyze can pick it up and represent it as per-node data, not per-process data, if that is easier (it will be). This changes Sonar minimally, does not change the data format other than superficially (one more optional field), and makes everything conceptually cleaner.

Answer 7 · 2024-06-20T12:37:34.000Z

Some constraints and factoids for the actual encoding:

7-bit ASCII without using comma, doublequote, backslash, DEL or space (b/c they are special in the various output formats)
There are slightly less than 2^25 seconds in a year. Many nodes are up for a year, but few are probably up for more than 4 years (< 2^27 seconds).
8 decimal digits are needed to represent the number of seconds in 1-3 years
5 base-64 digits are needed to represent the number of seconds in 1-3 years
5 base-45 digits are needed to represent the number of seconds in 1-4 years
5 base-32 digits can represent the number of seconds in 1 year
Core counts per node seem to be no more than 256 at the moment. The A2 hpc nodes have 256 cores (2x128), it could look like the A2 gpu nodes have 288 cores (4x72) but it's not completely clear. Probably core counts will remain "moderate" for now: Linux apparently has a 256-core limit per node

Option 1: literal values are printed in decimal with some separator. We should assume this will commonly be 7-8 digits per value + the separator, say an average of 8 total. This isn't too bad. It would be worth it to reduce it by half, but not by a little.

Option 2: literal values offset from the minimum value in the set, with some separator. There's a risk of outliers so this will probably yield about 5 digits per value + the separator, an average of 6, and we'll need to print the base too (one extra value).

Option 3: as option 2, but encode the separator in the digits by using a larger character set (eg, for 0 in the initial position print a, for 1 print b, and so on), now we're down to an average of about 5.

Option 4: as option 3 but use a larger base. We have enough characters for base-45 (90 different characters, leaving only = unused). This brings us down to about 4 characters per value, 1KB per sonar invocation for a 256-core node.

Option 5: as option 2, but uleb-encode into a byte array and then base64-encode this. This is "more standard" but experiments suggest it will still be about 4 characters per value b/c the base64 encoding inflates it.

Option 6: as option 5, but pick an optimal base value in the set rather than the minimum value, and use signed encoding. This will be less sensitive to outliers but the sign will need to be represented in every value. I have not run an experiment.

There are other options but we're trying to keep this sane and simple. Option 4 is the simplest by far and yields a 2x improvement over the naive encoding, and to do much better we probably have to do actual compression to take advantage of the distribution of values in the data themselves. A Huffman code with a static distribution is appealing but would be brittle if the data do not fit the distribution. Anything with a dynamic compression scheme would need to represent the compression tables.

Answer 8 · 2024-06-22T14:48:43.000Z

The encoding efficiency of the draft patch appears to be pretty good. On ML9, with 192 cores up for 32 days, the encoding uses 560 bytes, 2.92 per cpu. The values in /proc/stat are mostly 5 decimal digits, so on average a naive encoding would use 6 bytes per cpu, including a separator. A binary uleb64 delta encoding would require 2 bytes for virtually all the items and would inflate to about 3 bytes when base64-encoded.