Solve performance issues
sebastianelsner opened this issue · 16 comments
Hello we are running into performance issues where the solve thread on the server is running at 100% and at some point the server is crashing.
I attached perf
to the server and found that it is using a lot of its time in "hasTicket", "hasPoolTicket" and "hasHostTickets" calls, as well as a bit less time in regex matching. In the next images you can see the cumulated times used in functions. First read from bottom up, which is the start of the afserver. We are also only looking at the solve thread. The dark blue underlined bars are the time spend in regexes, the light blue bars are the time spend in hasXXXTickets. Especially the ::find(string) calls of the underlying ticket map seem to hit hard.
The next images are inverted, so the long calls are at the bottom. First image hightligts the "hasXXXTicket", the second image hightlights the hostmask regexes.
The broader a bar is at the bottom the longer the function took cumulatively. You can see map::find() takes quite some time cumulatively
How can we speed up the hasTicket queries?
Can we do something about how afserver is compiled?
Can we switch out the implementation to something faster (I think unordered_map is faster in access).
Do you have any other ideas?
Parallelize some things?
Regards
Sebastian
Since I see a mention of memcmp_avx2 would it help if we would run afserver on a CPU with avx512 support or does this make no difference?
Hello!
You can try to compile afsever
with any flags.
We can try to use unordered_map
.
Server already has several threads, for each type of operations.
Read/write network, write store, main thread for run cycle to refresh and solve.
It will very hard to parallelize jobs refreshing and solving. As each should increment and decrement various counts, to there should be lots of synchronization, and this can level out parallelization.
Hey Timur,
can you tell me which check comes first: check if hostmask of a job/task matches or if the host has the proper ticket?
hostmask comes first
Would adding pools instead of hostmasks to a job increase its performance? For example, Berlin: 100, rest: 0?
Currently, we have a pool per site, but we don't assign them to jobs. Since our hostnames vary per site, we ensure that a job runs on the intended site by only setting a hostmask.
I mean, if we need to check a Job against ~10 pools, vs a jobs hostsmask against ~1000 nodes it should increase our performance, I would expect.
Thanks,
Jan
Hey Timur,
we have also noticed that the cpu load of afserver goes up just by having lots of "DON" jobs.
we usually have 1000 RDY jobs and sometime between 4000-10000 DON jobs and they seem to have a huge impact on the server performance.
shouldn't those jobs be skipped really fast? is there anything that the server needs to check on them?
cheers
Oli
Hello!
- I think that using pools instead of hosts mask can increase performance. You can simple test it, as you have lots of jobs and renders.
- DONE jobs are skipped in solving. Unfortunatelly DONE jobs refreshes same way as not DONE. Any node runs refresh in Afanasy every second. Jobs check depends and refreshes all task to refresh counts and state. This needed, for example, if you restart task of some done job.
But i see that we can optimize refresh, i think that we can totally skip refresh of a DONE job, and run refresh on any action (restart something).
- DONE jobs are skipped in solving. Unfortunatelly DONE jobs refreshes same way as not DONE. Any node runs refresh in Afanasy every second. Jobs check depends and refreshes all task to refresh counts and state. This needed, for example, if you restart task of some done job.
But i see that we can optimize refresh, i think that we can totally skip refresh of a DONE job, and run refresh on any action (restart something).
Hi Timur, since we are really struggling with many DON jobs right now (that we really need to keep) I wanted to ask if your proposal could be implemented in the near future? maybe the next 1-2 months? We already run automatic job cleanup-scripts and we cannot delete any more jobs because artists still need them.
this would certainly help reduce the pressure on our farm enormously.
i have created a follow up ticket regarding multi-threading improvement over here #594
Our afserver currently uses the std:regex library to match the hostmasks, but can also use the posix regex libary. The former is "new" for us under almalinux 9, the latter was previously used by us on centos 6 (and maybe partly centos 7). I now had a look at the regex library speeds of std::regex and posix regex. It turns out, what the rest of the internet also says, that std::regex is much slower. factor 2-3. i then built a comparison with my incredible c++ knowledge (and maybe a bit of ChatGPT). Result: with our hostmasks the difference is clear, posix is 3x faster.
Total POSIX Regex time: 400 microseconds.
Total std::regex time: 1225 microseconds.
Posix regex is built into af, but you can't control which one it uses. it decides itself at compile time based on the compiler version. So I would have to add the option to change via command line. feature-wise the engines are the same - at least for our usecases. we're even more likely to go back to our "old setup", because the fact that std::regex was used instead of posix is "rather coincidental" due to the no-so-transparent selection process with the upgrade to alma.
@timurhai do you see any issue arising from changing the posix lib even when under std-c++11 compile flag?
Yes, we can add some -posixregex
flag to build.sh script to force to use posix regex libary.
We can also implement some simplemask
mode (on a job, user, block), where we can just use find
function to find some part of string in other.
May be in 99.9% cases we do not need complex regular expressions.
ps
I want so say, that regex type can be not compiled-in, but can be a paremeter.
Yes! It can help, but like I wrote there it will also limit a few users, so we need to talk about it internally.
As you can see above in the perf screenshots most time is spent in "hasTickets". Maybe you have an idea how to speed that up additionally? I think the main issue comes from having to walk all the way for each task to each render to each pool the render is in. many lookups in the ticket maps, which take long.
Interesting idea, but regex matching is not our problem at RISE. in our case the hasTickets function is still the bottleneck. maybe we should rather look into this issue > #596