parallel multiple orders of magnitude slower than gnu parallel
nkh opened this issue · 13 comments
the test below are with 10 000 iterations, too few to see a gain with gnu parallel or rust parallel but it shows the difference between them
I also piped the output to less rather than /dev/null; around 4500 entries, less displays the message "waitng for input"
The test uses --pipe and -q, which perform as expected. I tested the output on a smaller input set
without parallelization
seq 10000 | time -v piper --global hi blue '\d+' red > /dev/null
Command being timed: "piper --global hi blue \d+ red"
User time (seconds): 0.06
System time (seconds): 0.00
Percent of CPU this job got: 94%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 7884
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1123
Voluntary context switches: 1
Involuntary context switches: 2
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
gnu parallel
seq 10000 | time -v parallel -q --pipe piper --global hi blue '\d+' red > /dev/null
Command being timed: "parallel -q --pipe piper --global hi blue \d+ red"
User time (seconds): 0.15
System time (seconds): 0.01
Percent of CPU this job got: 94%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.18
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 16448
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 18527
Voluntary context switches: 137
Involuntary context switches: 31
Swaps: 0
File system inputs: 0
File system outputs: 1544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
rust parallel, check Minor (reclaiming a frame) page faults
seq 10000 | time -v rust-parallel -q --pipe piper --global hi blue '\d+' red > /dev/null
Command being timed: "rust-parallel -q --pipe piper --global hi blue \d+ red"
User time (seconds): 116.85
System time (seconds): 9.37
Percent of CPU this job got: 600%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:21.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 7024
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 9793578
Voluntary context switches: 71623
Involuntary context switches: 65275
Swaps: 0
File system inputs: 0
File system outputs: 176
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
hugepage info
cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
10,000,000 minor page faults is rather excessive, so that could be the cause of this performance issue. I have no idea what would cause this, but I'll look into it later when I have time. What's this piper
command that you are using?
a perl scrip that puts color on a stream. I can make it available if you want to test with the same executable. But I don't see how that would matter, it works fine with gnu parallel and there is nothing specific to it (It's my code).
Just let me know how I can help with the testing.
If you're willing to test to see if it's caused by jemalloc, add the following at the top of main.rs
and rebuild it. This will disable jemalloc and use the system allocator instead.
#![feature(alloc_system)]
extern crate alloc_system;
To test to see if it's caused by excessive locking/unlocking between threads, you can run the program with parallel -j 1
. The output of running the command through perf trace
would be helpful.
it doesn't build.
123 ~/3/rust-parallel ■ 1f 3+ 2u cargo build --release
Compiling parallel v0.6.2 (file:///home/nadim/nadim/devel/repositories/rust-parallel)
error[E0554]: #[feature] may not be used on the stable release channel
--> src/main.rs:1:1
|
1 | #![feature(alloc_system)]
| ^^^^^^^^^^^^^^^^^^^^^^^^^
error: aborting due to previous error
error: Could not compile parallel
.
To learn more, run the command again with --verbose.
rustup default nightly
OK but I neither understand what that would mean nor what I need to do run the tests you wanted.
Rustup, the official rust toolchain manager and installation method, allows you to manage multiple toolchains and conveniently switch between them. Running rustup default nightly
will set the active rustc and cargo to the nightly toolchain, which will allow the #[feature]
tag. The #[feature]
tag is being used to use the system allocator. This will build the application without jemalloc, using the system allocator instead.
For debugging, you can simply get the output of
seq 1 1000 | sudo perf trace -v target/release/parallel echo {} > /dev/null
And for more extensive debugging, you can
seq 1 10000 | sudo perf record target/release/parallel echo {} > /dev/null
Followed
sudo perf archive
And then by
tar xvf perf.data.tar.bz2 -C ~/.debug
Ultimately, on my system, most of the time being spent is not being used by parallel but by the Linux kernel & externally-running processes like bash. seq 1 10000 | parallel echo {} > /dev/null
takes 4s on my budget AMD laptop whereas GNU Parallel takes 120s.
As an additional note, try setting your default shell to dash
. Shells like zsh
and bash
can ridiculously slow down the speed of command execution.
Shell | Memory (KB) | Time (seconds)
Zsh | 3412 | 0.017
Bash | 3444 | 0.014
Ion | 2100 | 0.003
Dash | 1560 | 0.001
Got around to pushing an update that will help performance a decent amount if you have dash installed. If dash is detected, the default shell will be designated as dash. If only one command argument is supplied, then no shell will be used. I've also updated the benchmark records, so it's not as fast as it used to be when I made the original benchmark, but it's close. The main reason for the slowness now is due to I/O from reading and writing to the disk in order to support large volumes of input arguments.
And now the shell will be disabled so long as you don't have any arguments containing &
or ;
which would infer that you're running more than one command.
I'll be pushing out another performance-related patch soon. I've figured out a method to allow me to change the command token argument signature from Token::Argument(String)
to Token::Argument(Cow<'a, str>)
, which means that instead of cloning each and every argument from the original String
to a series of new Strings
, I will be taking them by reference when possible.
To do so, I needed to coerce the original command String
into a &'static str
via leaking the command's memory to ensure that it will live through the entire program's lifetime. This allows me to share it across all threads, which requires the 'static
lifetime. It shaves memory consumption from 7400KB
down to 3400KB
for the benchmark, and slightly increases performance (from 6s to 5s).
It's safe to do this as I am shadowing the original String
with the replacement &'static str
, to ensure that the original will never be modified.
With this major change, performance has been improved greatly, and now matches the original performance that I had before. The benchmark has been updated again. Therefore, I will close this issue.
This is a note that I found out that compiling Parallel with the MUSL toolchain produces a binary that's significantly faster, so the benchmark has been updated again with a note to build Parallel with MUSL. Here's the times I'm getting, taking note that memory consumption and CPU time is halved:
Command being timed: "./parallel echo"
User time (seconds): 0.39
System time (seconds): 2.18
Percent of CPU this job got: 91%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.82
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1768
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 828343
Voluntary context switches: 91062
Involuntary context switches: 69057
Swaps: 0
File system inputs: 0
File system outputs: 304
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Basically, all you will do is:
rustup target add x86_64-linux-unknown-musl
cargo build --release --target x86_64-linux-unknown-musl