qbittorrent/qBittorrent

Default value for Asynchronous I/O threads should be changed

FranciscoPombal opened this issue ยท 29 comments

Please provide the following information

qBittorrent version and Operating System

Any version on any OS.

If on linux, libtorrent-rasterbar and Qt version

Not applicable

What is the problem

qBittorrent currently ships with the Asynchronous I/O threads advanced libtorrent config set to 4. As per my own testing based on this comment, SHA-1 hashing happens on n/4 threads, where n is the value of this setting. So, having this set to 4 leaves a lot of performance on the table especially in rechecks for most modern machines which usually have at least 4 hardware threads (e.g 2c/4t CPUs on notebooks).

What is the expected behavior

qBittorrent should either ship with this set to at least 16 or have some way of auto-setting it to 4*n_hardware_threads.

Steps to reproduce

Download moderately large torrent that takes some time to recheck.
Force recheck the torrent.
Observe how only one thread does the work
Change Asynchronous I/O threads to n*4 where n is the number of hardware threads in your machine.
Force recheck again, observe how many threads do the work and it completes much faster.

Extra info(if any)

I edited the wiki page with these findings - https://github.com/qbittorrent/qBittorrent/wiki/Explanation-of-Options-in-qBittorrent/957bb3f50d0cff41f33b58d90b581409a648b089

Setting values greater than n*4 for this setting will most likely not help and at worst reduce performance as well due to extra thread overhead.

Increasing the value of this setting might not help (at least very much) if the storage medium is a severe bottleneck for the system already, but in that case it should not hurt much if at all either.

@arvidn
I'd like your input on this, namely regarding the correctness of my analysis and whether or not all of this is expected behaviour from qBIttorrentlibtorrent.

From Libtorrent

If disk IO is a bottleneck (which I would expect it to be given a fast internet connection), you probably want to set aio_threads to something much greater than 2. probably 16 or 32.

If i understand that setting correctly, the value corresponds to engaged cpu threads. If you have a hyperthreaded 2 core cpu, then setting the value to 4 would engage all 4 threads. But i read somewhere that utilizing both threads on 1 core could mean it's fighting it self over resources, so the value should be physical cores. But may be that is not correct considering the values you put.

Those are specifically disk I/O threads, so they will mostly either be idle (waiting for work) or suspended in a blocking disk I/O system call. Either way, they won't do a lot of actual computation.
However, one in every 4 disk I/O thread is dedicated to perform the SHA-1 hashing of incoming data, so those can actually end up using CPU while downloading.

From Libtorrent

If disk IO is a bottleneck (which I would expect it to be given a fast internet connection), you probably want to set aio_threads to something much greater than 2. probably 16 or 32.

If i understand that setting correctly, the value corresponds to engaged cpu threads. If you have a hyperthreaded 2 core cpu, then setting the value to 4 would engage all 4 threads. But i read somewhere that utilizing both threads on 1 core could mean it's fighting it self over resources, so the value should be physical cores. But may be that is not correct considering the values you put.
Those are specifically disk I/O threads, so they will mostly either be idle (waiting for work) or suspended in a blocking disk I/O system call. Either way, they won't do a lot of actual computation.
However, one in every 4 disk I/O thread is dedicated to perform the SHA-1 hashing of incoming data, so those can actually end up using CPU while downloading.

Nice, this pretty much confirms my initial analysis/prediction.

We just went through disabling qBittorrent from being able to perform multiple force-rechecks at the same time. In the majority of machine configurations, you do not want to have two torrents being checked / rechecked, otherwise it causes undue disk thrashing and random seeks. Almost all drives perform best with sequential seeks, reading one-file-at-a-time.

We just went through disabling qBittorrent from being able to perform multiple force-rechecks at the same time. In the majority of machine configurations, you do not want to have two torrents being checked / rechecked, otherwise it causes undue disk thrashing and random seeks. Almost all drives perform best with sequential seeks, reading one-file-at-a-time.

Sure, but this discussion is not about force-rechecks, it's about the SHA-1 calculations of pieces that are downloading.

It's 2020, basically every x86 processor released in the last 10 years, including Atoms, Celerons, and Pentiums, both on desktops and on laptops has at least 2c/2t. There are about 10 exceptions that I could find to this rule relating to lower-power Celeron SKUs released as far as 2013 on the Intel side and about as many or less for 2011 Semprons and the like on the AMD side.[1,2] But surely machines using those processors are unusable for basically anything else as well at this point, even for more reasons than just core count.

If qBittorrent's policy is to ship default settings suited to the majority of users ("average use case"), I would argue that anyone using a single-core processor to run qBittorrent in 2020 is as part of the "average" as someone with a 64 core Threadripper, i.e., not at all.

In fact, I would argue that anyone running 2c/2t also falls into the same category (i.e. not part of the "average" and also pretty much unusable).
Nowadays the cutoff for anything remotely usable in general is 2c/4t (and it is this low just because of laptops/ultrabooks, which are only now totally moving away from 2c/4t for higher core counts; the average desktop surely has to be at least at 4c/4t nowadays).

As such, the default value for "Asynchronous I/O threads" should be bumped at an absolute minimum to 8, and ideally to 16, depending on whether we want the lower bound of "average" to be at 2c/2t or 2c/4t, respectively.

As a reminder, setting "Asynchronous I/O threads" to N spawns N/4 sha1 threads for SHA1 hashing when rechecking, etc. The remaining N - N/4 are disk IO threads which are mostly sitting idle, and will basically never be the bottleneck. This has nothing to do with simultaneous moving/rechecking of multiple torrents. By shipping with the default at 4, qBIttorrent leaves a lot of performance on the table for pretty much all of its users, since it forces libtorrent to only ever use 1 SHA1 hasher thread, thus only taking advantage of one core/thread.

[1] https://en.wikipedia.org/wiki/Comparison_of_Intel_processors
[2] https://en.wikipedia.org/wiki/List_of_AMD_microprocessors

@FranciscoPombal Perhaps also with the ever increasing core/thread counts on next gen processors the "Max" limit needs to be increased too, as currently it can't be set no higher than 1024 "Asynchronous I/O threads" or an algorithm applied in the background to "Auto-tune" the threads to tailor a users system....similar to the "auto disk cache algorithm"

@xavier2k6 to be fair, that's 256 max hash threads. Even if a prosumer chip appears in the market in the next year or two that has that many threads, at that point I expect the network/disk throughputs will have long become the bottlenecks. Even when using a PCIe 4 SSD, and assuming ridiculous network speeds, there is probably no difference between using say 64 and 32 hash threads.

The main concern is using too little threads on "normal" core-count CPUs.The big difference is between using 1 vs 4 or 4 vs 8 with fast enough storage.
The "auto-tune" could solve this problem. @arvidn, is it possible for libtorrent to set the number of async IO threads to 4x the number of hardware threads automagically? Assuming of course 1 in every 4 async IO threads is a hash thread.

to be fair, that's 256 max hash threads. Even if a prosumer chip appears in the market in the next year or two that has that many threads, at that point I expect the network/disk throughputs will have long become the bottlenecks. Even when using a PCIe 4 SSD, and assuming ridiculous network speeds, there is probably no difference between using say 64 and 32 hash threads.

It's all about future-proofing......lol

It'll be interesting if libtorrent could apply some logic in the background & that way the option doesn't really need to be exposed in qBittorrent's GUI.

Set it to Auto by default & be done.

Can we have an "Auto / Optimum" option that just counts the cores and threads at startup? This might help ease a user's comfort and confidence in knowing they have it set to a correct and sane value, because the software performs actual detection on the lay user's behalf?

is it possible for libtorrent to set the number of async IO threads to 4x the number of hardware threads automagically? Assuming of course 1 in every 4 async IO threads is a hash thread.

this would be very easy to do for a client though (the way libtorrent implements settings right now, it's not trivial to have dynamically determined default values). but:

sett.set_int(lt::settings_pack::aio_threads, std::hardware_concurrency() * 4);

so has a consensus been reached on this? regarding if its 4x the physical cores, or 4x the threads? or neither?

trying to make a semi educated decision on how to set mine. I use a 8700k (12 threads) but it isnt a seedbox. Its my AIO so i want to give it a bit more performance in the torrent department, but without wreaking havoc on the big picture.

@klepp0906

so has a consensus been reached on this? regarding if its 4x the physical cores, or 4x the threads? or neither?

trying to make a semi educated decision on how to set mine. I use a 8700k (12 threads) but it isnt a seedbox. Its my AIO so i want to give it a bit more performance in the torrent department, but without wreaking havoc on the big picture.

There is no magic here - the more hashing threads you have, the faster you can hash until you either saturate your CPU's hashing capacity, or until your storage/memory can no longer feed (with pieces) the CPU fast enough for the extra threads to be utilized.

Just set it to 4x the number of physical threads. Even if your storage is too slow to take advantage of all the threads, the extra threads won't hurt your performance in other applications if they are not doing much work (the overhead from spawning a few extra threads up to the 4x the number of physical threads vs the "optimum" number should be negligible). On the other hand, if you set the option to less than 4x the number of physical threads (4x the number of physical cores, for example), you risk leaving free performance on the table, depending on how fast your storage/memory is.

excellent. thank you for the reply.

I didnt want to bottleneck anything else that in the end, was more important or at least higher priority. Still, as you said performance on the table is no good either.

We'll do 4x the physical cores then. Will set it to 24 (has been set at 8)

excellent. thank you for the reply.

I didnt want to bottleneck anything else that in the end, was more important or at least higher priority. Still, as you said performance on the table is no good either.

We'll do 4x the physical cores then. Will set it to 24 (has been set at 8)

You should see a significant speed increase from 8 to 24, provided your memory/storage/network are not very slow. For example, if you are using pretty much any SSD, you should see much faster recheck speeds. You can comfortably set the threads to 48. Remember, in libtorrent there is only 1 hashing thread for every 4 async I/O threads, so 48 async I/O threads means 12 hashing threads, which in your case is exactly one per physical thread. This is the number that should get you peak performance from your CPU, provided storage/memory/network can keep up.

now that drives are faster than they used to be, maybe the ratio of 1 our of 4 being a hasher thread should be reconsidered. Or maybe they should be configurable independently.

I think there's a risk of configuring too many io threads. the I/O queue could get too deep and add unreasonable queuing latency.

@arvidn

now that drives are faster than they used to be, maybe the ratio of 1 our of 4 being a hasher thread should be reconsidered. Or maybe they should be configurable independently.

I think there's a risk of configuring too many io threads. the I/O queue could get too deep and add unreasonable queuing latency.

+1 for independent configuration. Currently, the name of the setting does not make it obvious at all that one thread will be a hasher thread.
As for queue depth/latency, it would probably be interesting to benchmark that, to try to figure out a reasonable default value, but I don't imagine it to be easy to do.

I concur. It makes sense to separate the hashing configuration from general AIO. It sounds like the AIO configuration is probably a reasonable default, but that coupling it to re-checks is confusing to users.

@arvidn

now that drives are faster than they used to be, maybe the ratio of 1 our of 4 being a hasher thread should be reconsidered. Or maybe they should be configurable independently.

Do you plan to address this in time for the libtorrent 2.0 release (or possibly as another patch for RC_1_2)?

I'm just experimenting with a reasonable default for 2.0. But I don't think adding another configuration option will make it in 2.0

@arvidn

I'm just experimenting with a reasonable default for 2.0. But I don't think adding another configuration option will make it in 2.0

Well, that's too bad, that really seemed like the best solution - though I completely understand if it is the case that it's rather involved to make it separately configurable.

About the reasonable default: I would suggest 16 or 32, unless you plan on changing the 1:4 ratio to something else as well.

yeah, I think a higher default is most likely reasonable. I'm just collecting some data to support my feeling right now :)

also worth pointing out, the aio_threads setting is the max number of threads. Threads are created dynamically on demand, so it's not necessarily a cost for scenarios where it's not necessary

@FranciscoPombal perhaps there could be something done on qBittorrent side (until libtorrent 2.x hits ) with those experimental results for current defaults as current qBittorrent defaults may benefit SSD over HDD?

NEW
SSD Defaults -> aio_threads = 64 & checking_mem_usage = 256

NEW
HDD Defaults -> aio_threads = 8 & checking_mem_usage = 256

@arvidn if you get to experiment again with this feature could you include aio_threads up to 1024 as that's the "MAX" in qBittorrent & checking_mem_usage to iclude 16/32/64/128

I think those numbers may be too high @xavier2k6. I have seen performance loss at aio_threads being that high. Keep in mind this is A IO.

@don4of4 they may seem too high, but they are there only as a suggestion - the main thing to take from the results from my point of view as a user is that the current aio_threads & checking_mem_usage default(s) should be different for SSD & HDD as currently it seems that the default(s) are not a "one size fits all".

I could also be totally misunderstanding this concept!

@xavier2k6

From the experiments in https://github.com/arvidn/libtorrent/wiki/disk-I-O-settings-for-checking-files, it looks like checking_mem_usage > 256 is irrelevant. Values < 256 were not tested, but probably they don't really matter - 256 corresponds to 4 MiB, which is already pretty low - no significant memory savings would come from lowering further (assuming this quantity is global an not per-torrent; I believe the former is the case, IIRC).

But most importantly, it looks like bumping aio_threads above 8 (which is the same as saying the number of hashing threads above 2) is not a safe change in the general case, as it would hurt the performance of HDD users too much.

Still, bumping from 4 -> 8 would be a good change that would benefit SSD users without hurting HDD users.

Other than cat /sys/block/sdX/queue/rotational on Linux I don't know of a nice way of distinguishing between SSD/HDD, and that's a band-aid solution anyway - I think the real problem is this (last sentence in the test results):

Some more investigation need to go into understanding what happens at 3 hasher threads on a hard drive.

If the AIO disk I/O threads subsystem can be reworked to prevent this kind of performance degradation regardless of storage medium type, that would be ideal.

256 corresponds to 4 MiB, which is already pretty low - no significant memory savings would come from lowering further (assuming this quantity is global and not per-torrent; I believe the former is the case, IIRC).

It's per checking torrent. so, typically one at a time.

The consensus thus far is:

  • 2 hashing threads is the best default right now (and it is the default in libtorrent 2.0). It provides a good performance boost in both systems that use HDDs as well as SSDs. Greater values exhibit performance regressions in systems with HDDs, while systems with SSDs seem to scale "indefinitely". For libtorrent <= 1.2, 2 hashing threads means setting aio_threads to 8 <= aio_threads < 12. Presumably, 2 hashing threads would exhibit no advantage or even a regression in performance on single CPU core systems, but this is not a common use case nowadays (no one is seriously running real single core systems anymore, and people who setup VMs to use a single core are expected to change the default).