/NUMAyei

[EXPERIMENTAL] Custom NUMA node scheduler - distributes non-adapted Windows programs into several NUMA nodes evenly

Primary LanguageC++MIT LicenseMIT

NUMAyei scheduler (non-NUMA app -> NUMA app)

Attention:

  • While this is very crude solution, in no case use it for multiplayer games, principle hooking functions is used as in cheats.

  • Use only Release builded DLL for injection! Performance Debug configuration will spoil NUMAyei scheduler.

Where did idea come from?

I watched one video How Bad is This $10,000 PC from 10 Years Ago?? from YouTube Channel (@LinusTechTips) and it became funny to me that they tried to compare launch Crysis 2 on server with x4 GPU SLI and did not see difference in fps metrics. But they did not realize that emphasis was on one CPU because game created threads for one processor, and not for both, so FPS did not change whether new 1x GPU or old x4 GPU SLI. Also, most likely game is made only for maximum 2x SLI, but no more, again, I actually have plans to fix this, in theory, I can make hack to unlock SLI scale GPUs. I hope that @nharris-lmg will be able to notify LTT team about this and re-test Crysis 2 on old NUMA server motherboard using this hack. Not to forget to enable NUMA feature in bios!

Moment from video:

2 CPU full utilize

TODO:

  • Hook and rewrite VirtualAlloc to VirtualAllocExNuma for each NUMA node
  • Hook any OpenProcess and CreateProcess for migrate to other NUMA nodes
  • Hook any method detect cores and threads
  • Hook open process as double-click or context menu right-click for non-PRO users not to run cmd.exe or powershell (in future numayei.exe ./binary_non_numa.exe)

Summary:

This will not make sense if running program does not initially separate WinAPI threads using CreateThread() and similar functions.

Mechanism load balancer is simple, most programs do not work and do not call functions specifying NUMA node or group processors, so by default very first one is selected (only NUMA 0 or only NUMA 1).

If we inject NUMAyei scheduler in app we can redefine all functions to assign to each NUMA node in the system and using NUMA allocator.

I strictly remind you that programs that are already optimized for NUMA will not give any strong performance profit. Examples that I tested: XMRig miner, some benchmarks like Corona, Blender, etc.

It is desirable to have some kind text database on Wiki Github where there info compatibility and it will increase performance as percentage.

Why you should use NUMAyei sheduler?

No one forces you to do this, but you will notice how power consumption CPUs will decrease, as well as their performance will increase, since for one NUMA node to work. With my thread scheduler, you have fully load system on fully worked 100% !

Windows scheduler has to allocate the highest frequency to calculate task faster, while second node is idle, performing background tasks unrelated to main working desired process used.

Requirements:

  • Minimum version Windows 11 (I will try to make version lower)
  • Any DLL Injector
  • Builded NUMADLL.dll (once I set up Github CI, you can download latest binaries from here)

Tested on:

  • Windows 11 Pro 23H2 [22631.3296]
  • 2 NUMA nodes (dual socket)
  • Xeon E5 v3-v4 family CPU

How to Use?

  1. Download any DLL Injector, I advise you to take an opensource. (im tested on Xenos64 Injector)
  2. Select compiled NUMADLL.dll and inject in any process (any methods)
  3. In future NUMA.exe it will be an injector and through its parameter you can specify path to running exe or active running process by PID.

Good example below in screenshot using NUMAyei with running binary not NUMA-aware adapting.

NUMA run binary optimization

NUMA full utilize

NUMA Windows full implementation verified bencmark CPU-Z

Updated with new NUMA allocation (beginning with commit 6c6fb5)

In new CPU-Z version:

2 old server cpu equal modern cpu

Xeon E5-2699 v3 (dual socket) vs modern cpus