Fox config info is not right
Opened this issue · 2 comments
Looks like this needs to be regenerated. gpu-9 is 4xA100 but we've recorded 2xH100 and so the plots are not looking right. There probably should be some kind of alert for this, if we're not going to make the config info time sensitive.
- Merge slurminfo and make-config-file, the current situation is an unmaintainable mess
- Be sure to update documentation
- Regenerate fox info and deploy it
Actually this affects the Fox GPU usage numbers which are becoming a thing (#522), so we should explore this with some more haste.
I may be making this too complicated. In reality, we'll bring up a cluster by making it report some initial data including sysinfo. In principle it should be able to do that without referring to the config file. (The add operation does not require the config to know about every host, though it will be good if there's an empty config present.) Then after a day or so we'll run make-config-info to generate an initial config file from the sysinfo; this will be missing all nodes that are down at the time, but they can be added by hand, instead of from the background file - that process is fairly tricky anyway. Until the config file is available we can't use the dashboard or remote sonalyze, but this is not all that important. Once the config file is present the server has to be restarted again. Probably we want it to have some sort of restart functionality anyway.