joyent/libuv

Change default thread pool size from fixed 4 to number of logical cores

Closed this issue · 12 comments

Hi.

I've been following libuv for a couple of months and I'm currently using its thread pool. I have read the other bug reports about implementing automatic detection of number of threads and those reports have all been closed without any actual action being taken. libuv 1.0 is coming up and the only thing happened to the thread pool seems to be that Windows version now also has 4 fixed threads by default:

Quote from the docs:
"Its default size is 4, but it can be changed at startup time by setting the UV_THREADPOOL_SIZE environment variable
to any value (the absolute maximum is 128)."

It's a no-brainer that having 4 threads in a pool, on a machine that has 8 cores is not really ideal. It should be a simple 1-5 lines of code fix to use a more appropriate number of threads by default:

Windows:

SYSTEM_INFO sysinfo;
GetSystemInfo( &sysinfo );
numCPU = sysinfo.dwNumberOfProcessors;

Unix:

numCPU = sysconf( _SC_NPROCESSORS_ONLN );

(Code taken from a quick search on StackOverflow).

Using the number of logical threads instead of the number of hardware cores is key for using Hyperthreading, etc.

OFC: If you want me to fix it i'll fix it!

txdv commented

You do realize that there is no correlation between the amount of cores and the IO throughput?

The thread pool is basically for IO operation which have no asynchronous counter part on the OS. There is no benefit your solution would bring.

In fact, on a single core CPU it will always block after 1 operation, even though there could be realistically more stuff going on.

There is no good solution to solve this. This is a problem that has to be solved at the kernel level.

I do realize that if the bottleneck is IO, then adding more threads (CPU time) to the pool would not make the IO process faster. Yes.

That's not my point. My point is improving the thread pool, not the IO making use of the thread pool.

"In fact, on a single core CPU it will always block after 1 operation, even though there could be realistically more stuff going on." -> Then make it max(4, number_of_logical_cores). sigh

"There is no benefit your solution would bring." -> Libuv is (from reading the docs) a general C library packed with features. One of the features is, the thread pool. How one uses libuv and it's thread pool is up to each person. I don't see in the docs that you have to use the thread pool for just IO.

Creating a thread pool with 4 threads on a 8 core machine is not ideal. I have personally never seen an implementation of a thread pool make use of a fixed amount of threads. Look at boost::threadpool, QThreadPool, OpenMP, etc. They all make use of an ideal number of threads. QThreadPool even has a function named int QThread::idealThreadCount:

"Returns the ideal number of threads that can be run on the system. This is done querying the number of processor cores, both real and logical, in the system. This function returns -1 if the number of processor cores could not be detected."

C++11 also comes with a similar function: std::thread::hardware_concurrency (int).

It's not rocket science. I don't understand why I need to explain and motivate this change.

txdv commented

You do understand that we don't run the threads in that thread pool hot? They are immediately blocked by some IO operations. The ideal pool size doesn't correlate with the number of cpus/cores.

It's not rocket science.

txdv commented

The presence of the maximum hardware concurrency variable in your mentioned frameworks has a very specific usage: You have a thread pool which has the size of the the amount of computational units which are able to perform computation concurrently. It is very relevant for something like a physics engine where you want to make all your cores sweat (as in, let the do as much computation as possible). You have M operations which are non blocking, pure computational operations and you have to distribute them on N cores.

However, every request we push to the thread pool in libuv is bound by IO immediately, we do it because there is no asynchronous counter part of a specific operation. So we emulate it with the thread pool. For example, linux has no asynchronous IO operations for files (open/read/write/close) so we push them to the thread pool let it handle it as if it were asynchronous.

So if we have 4 blocking IO operations (operations on sda) and the fifth is possibly non blocking or faster done than the first 4(operation on sdb or network) and we have a thread pool size of 4 (which happens to be that machines number of cores) then we still loose, because the thread pool will not execute the fifth operations until the first 4 are done.

The hardware concurrency does not correlate with the ideal amount of threads in the thread pool of libuv for our usage. Ideally we would have a thread pool for every blocking device we do operations on, but it is impossible to distinguish between all possible IO devices with the API that is exposed at the userland level.

Your argument is: Look, these libraries have that hardware concurrency variable, we should have too. We do not expose the thread pool for user usage, we do use it to emulate asynchronicity internally.

From reading the prior bug report: #649 I get the impression you were actually planning a change in future versions. But never mind that. Lets just start by correcting some bullshit you just spew out:

  1. Linux 2.6 (since 2003) comes with kernel level async IO for files (aio_read, aio_write, signals, etc) - use that.
  2. Yes you actually do expose the thread pool to the user. Look up uv_queue_work in the docs - I use it already for CPU bound tasks.

Just because IO doesn't get better with more threads doesn't mean you should cripple the thread pool by default, making CPU bound tasks slower than they should be. Having 8 threads in the pool on a 8 core machine is always better than having 4, even if IO will stay the same. Why? Because you cover two use cases: IO is happy and CPU tasks are happy. But hey - if you want to keep the thread pool a crippled bi product of your own ignorance to the Linux AIO support - go ahead. I'm out.

txdv commented

The thread pool is not available to the user.

AIO is not ready, I've checked it, the creator of rtorrent has checked it and has written an article about it. It supports only writes/reads (leaving closes and open for the thread pool) and only if it is page aligned, so mostly for database use.

bullshit you just spew out

You are trying to prove something on the internet. If you really think that you are so smart then just write a patch with the AIO functionality for linux. We will gladly accept it.

txdv commented

here hercules: http://blog.libtorrent.org/2012/10/asynchronous-disk-io/

Some observatios and directions, maybe they will help you when you write the AIO patch.

This got heated for no reason.

The thread pool is not available to the user.

Okay, but on the README.md it's listed as a feature highlight, which would imply (at least it confused me in such a way, as it apparently also confused @alexhultman) that there's a thread pool interface which the library user can use. If it's an implementation detail to achieve asynchronous I/O, then perhaps it shouldn't be listed as a feature highlight?

Also, back to the original topic of this discussion: would it be bad, if the event loop used more than 4 threads on an environment that would benefit from more threads? I know it wouldn't improve IO, but it's my understanding (which might be wrong) that uv_queue_work() queue work_cb to be executed inside the thread pool? So, wouldn't a bigger pool increase the performance on a machine with more cores (implying that there is no IO being submitted to the queue)?

txdv commented

I stand corrected, uv_queue_work exposes the thread pool to the user, but...

The idea of posting long running computational tasks to it though is not a very good one. libuv uses the threadpool internally to emulate async behaviour for blocking only functions(getaddrinfo, fs operations on linux). So if you post 4 long taking computational tasks to it with uv_queue_work(while the thread pool size is 4), you effectively deny yourself of using file system operations on linux or uv_getaddrinfo/getaddrinfo. You should only post something which blocks for a short amount of time.

I am actually confused why we expose it at all right now, since the user can basically screw himself over if he misuses it... but there is a lot of not so smart stuff in this library which need some attention of thought.

would it be bad, if the event loop used more than 4 threads on an environment that would benefit from more threads?

Increasing it has no negative impact. The only reason why it is 4 is that there is no deterministic way of a perfect size for the thread pool. Ideally we wouldn't rely on the threadpool at all, but asynchronous file io on posix/linux is a mess currently and there is no other way around to emulate asynchronous getaddrinfo and getnameinfo. You can change the size however you see fit.

If you want it to use for computation then you should increase it to M + N, where M is the amount of parallel computations want to do and N is the the rest of for blocking IO (which is 4 currently coded hard). There is currently no C api to change or set the size of the thread pool from within C because everyone is too lazy to write a patch.

My personal advice is not to touch this, just use your own thread pool or something. Because if you mess up, you won't be able to use getnameinfo getaddrinfo or any file operation on linux.

txdv commented

The limitations are very clearly documented here: http://docs.libuv.org/en/latest/threadpool.html

It is global singleton and the amount of threads is set only by a variable instance. For my personal taste this thread pool sucks for any kind of operations except for emulating blocking operations as non blocking.

I don't know why it is really available to the user....

This got heated for no reason.

@alexhultman is a douche, I just tried to write down my perspective and he just responds with:

Lets just start by correcting some bullshit you just spew out:
Linux 2.6 (since 2003) comes with kernel level async IO for files (aio_read, aio_write, signals, etc) - use that.

@bnoordhuis has looked into this and said that AIO is not ready, I have been looking into this for quite some time, even digging in the kernel to check whether everything is truly asynchronous, but here comes @alexhultman calls me a retard and states that AIO is ready to use right now.

We welcome patches.

I thought we were all grown ups around here, but it seems not. I'm closing this because this is now how we want our community to work. If you (both) want to use this language / manners just go somewhere else, you are not welcome.

@alexhultman if you think we can do better, by all means, write a patch, benchmark it, and submit it. Some stuff has been attempted in the past, without much success, though.

@txdv please stop using "we" for your assertions, which do not represent this project whatsoever, this is not how we deal with people / issues in our bugtracker. I can read past your edits.