parallel multiple orders of magnitude slower than gnu parallel

Question

parallel multiple orders of magnitude slower than gnu parallel

nkh opened this issue 8 years ago · 13 comments

the test below are with 10 000 iterations, too few to see a gain with gnu parallel or rust parallel but it shows the difference between them

I also piped the output to less rather than /dev/null; around 4500 entries, less displays the message "waitng for input"

The test uses --pipe and -q, which perform as expected. I tested the output on a smaller input set

without parallelization

seq 10000 | time -v piper --global hi blue '\d+' red > /dev/null
Command being timed: "piper --global hi blue \d+ red"
User time (seconds): 0.06
System time (seconds): 0.00
Percent of CPU this job got: 94%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 7884
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1123
Voluntary context switches: 1
Involuntary context switches: 2
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

gnu parallel

seq 10000 | time -v parallel -q --pipe piper --global hi blue '\d+' red > /dev/null
Command being timed: "parallel -q --pipe piper --global hi blue \d+ red"
User time (seconds): 0.15
System time (seconds): 0.01
Percent of CPU this job got: 94%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.18
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 16448
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 18527
Voluntary context switches: 137
Involuntary context switches: 31
Swaps: 0
File system inputs: 0
File system outputs: 1544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

rust parallel, check Minor (reclaiming a frame) page faults

seq 10000 | time -v rust-parallel -q --pipe piper --global hi blue '\d+' red > /dev/null
Command being timed: "rust-parallel -q --pipe piper --global hi blue \d+ red"
User time (seconds): 116.85
System time (seconds): 9.37
Percent of CPU this job got: 600%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:21.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 7024
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 9793578
Voluntary context switches: 71623
Involuntary context switches: 65275
Swaps: 0
File system inputs: 0
File system outputs: 176
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

hugepage info

cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

Answer 1 · 2016-11-13T16:34:40.000Z

10,000,000 minor page faults is rather excessive, so that could be the cause of this performance issue. I have no idea what would cause this, but I'll look into it later when I have time. What's this piper command that you are using?

Answer 2 · 2016-11-13T16:42:50.000Z

a perl scrip that puts color on a stream. I can make it available if you want to test with the same executable. But I don't see how that would matter, it works fine with gnu parallel and there is nothing specific to it (It's my code).

Just let me know how I can help with the testing.

Answer 3 · 2016-11-13T18:36:32.000Z

If you're willing to test to see if it's caused by jemalloc, add the following at the top of main.rs and rebuild it. This will disable jemalloc and use the system allocator instead.

#![feature(alloc_system)]
extern crate alloc_system;

To test to see if it's caused by excessive locking/unlocking between threads, you can run the program with parallel -j 1. The output of running the command through perf trace would be helpful.

Answer 4 · 2016-11-24T21:53:50.000Z

it doesn't build.

123 ~/3/rust-parallel ■ 1f 3+ 2u cargo build --release
Compiling parallel v0.6.2 (file:///home/nadim/nadim/devel/repositories/rust-parallel)
error[E0554]: #[feature] may not be used on the stable release channel
--> src/main.rs:1:1
|
1 | #![feature(alloc_system)]
| ^^^^^^^^^^^^^^^^^^^^^^^^^

error: aborting due to previous error

error: Could not compile parallel.

To learn more, run the command again with --verbose.

Answer 5 · 2016-11-24T22:55:17.000Z

rustup default nightly

Answer 6 · 2016-11-25T07:58:13.000Z

OK but I neither understand what that would mean nor what I need to do run the tests you wanted.

Answer 7 · 2016-11-27T18:45:13.000Z

Rustup, the official rust toolchain manager and installation method, allows you to manage multiple toolchains and conveniently switch between them. Running rustup default nightly will set the active rustc and cargo to the nightly toolchain, which will allow the #[feature] tag. The #[feature] tag is being used to use the system allocator. This will build the application without jemalloc, using the system allocator instead.

For debugging, you can simply get the output of

seq 1 1000 | sudo perf trace -v target/release/parallel echo {} > /dev/null

And for more extensive debugging, you can

seq 1 10000 | sudo perf record target/release/parallel echo {} > /dev/null

Followed

sudo perf archive

And then by

tar xvf perf.data.tar.bz2 -C ~/.debug

Ultimately, on my system, most of the time being spent is not being used by parallel but by the Linux kernel & externally-running processes like bash. seq 1 10000 | parallel echo {} > /dev/null takes 4s on my budget AMD laptop whereas GNU Parallel takes 120s.

Answer 8 · 2016-11-29T17:43:41.000Z

As an additional note, try setting your default shell to dash. Shells like zsh and bash can ridiculously slow down the speed of command execution.

Shell | Memory (KB) | Time (seconds)
Zsh   | 3412        | 0.017
Bash  | 3444        | 0.014
Ion   | 2100        | 0.003
Dash  | 1560        | 0.001

Answer 9 · 2016-12-22T20:31:39.000Z

Got around to pushing an update that will help performance a decent amount if you have dash installed. If dash is detected, the default shell will be designated as dash. If only one command argument is supplied, then no shell will be used. I've also updated the benchmark records, so it's not as fast as it used to be when I made the original benchmark, but it's close. The main reason for the slowness now is due to I/O from reading and writing to the disk in order to support large volumes of input arguments.

Answer 10 · 2016-12-22T21:45:55.000Z

And now the shell will be disabled so long as you don't have any arguments containing & or ; which would infer that you're running more than one command.

Answer 11 · 2016-12-23T05:20:42.000Z

I'll be pushing out another performance-related patch soon. I've figured out a method to allow me to change the command token argument signature from Token::Argument(String) to Token::Argument(Cow<'a, str>), which means that instead of cloning each and every argument from the original String to a series of new Strings, I will be taking them by reference when possible.

To do so, I needed to coerce the original command String into a &'static str via leaking the command's memory to ensure that it will live through the entire program's lifetime. This allows me to share it across all threads, which requires the 'static lifetime. It shaves memory consumption from 7400KB down to 3400KB for the benchmark, and slightly increases performance (from 6s to 5s).

It's safe to do this as I am shadowing the original String with the replacement &'static str, to ensure that the original will never be modified.

Answer 12 · 2016-12-23T08:12:01.000Z

With this major change, performance has been improved greatly, and now matches the original performance that I had before. The benchmark has been updated again. Therefore, I will close this issue.

Answer 13 · 2016-12-23T20:08:03.000Z

This is a note that I found out that compiling Parallel with the MUSL toolchain produces a binary that's significantly faster, so the benchmark has been updated again with a note to build Parallel with MUSL. Here's the times I'm getting, taking note that memory consumption and CPU time is halved:

	Command being timed: "./parallel echo"
	User time (seconds): 0.39
	System time (seconds): 2.18
	Percent of CPU this job got: 91%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.82
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1768
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 828343
	Voluntary context switches: 91062
	Involuntary context switches: 69057
	Swaps: 0
	File system inputs: 0
	File system outputs: 304
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Basically, all you will do is:

rustup target add x86_64-linux-unknown-musl
cargo build --release --target x86_64-linux-unknown-musl