mmstick/parallel

Performs slower than GNU parallel when transparent huge pages are enabled

Shnatsel opened this issue · 15 comments

The benchmark in README.md as of version 0.5.0 claims time 0:04 for Rust parallel vs 0:54 for GNU parallel. However, this benchmark is misleading because the command line used is completely useless:

seq 1 10000 | time -v parallel echo > /dev/null would never print anything because it lacks parameter substitution. The correct command that actually does something would be seq 1 10000 | time -v parallel echo '{}' > /dev/null

On my machine Rust parallel measures 1:02 vs 0:33 for GNU parallel for the actually useful command.

Parameter substitution isn't required. When you don't provide a placeholder token, it is automatically inferred, so parallel echo '{}' is equivalent to parallel echo. I'm not sure why it would be slower for your system. Using either commands still gives me the same results for my version, with GNU Parallel taking 40x longer.

What processor do you have? Is it AMD or Intel? Cores? Hyper-threading?

Here's my /proc/cpuinfo

That kind of difference would rather stem from rustc difference than CPU difference.

$ rustc --version
rustc 1.11.0 (9b21dcd6a 2016-08-15)

Parallel is built from git with cargo build --release

I've double-checked and I am indeed using version 0.5.0 of Rust parallel

Forcing CPU into the highest frequency only makes GNU parallel a bit faster (3x difference again), does not affect Rust parallel.

Explicitly passing -j 4 does not change anything either.

I do believe that it is a rustc or LLVM bug. Sadly, I've also noticed that my AMD FX 8120 also performs abysmally for some reason -- even slower than my mobile Intel CPU, and others have noted the same behavior too. Intel processors aren't exhibiting this issue, and my benchmarks were taken from an i5-2410M CPU @ 2.30GHz using the Performance governor.

I've reported the bug here rust-lang/rust#36705 so maybe someone who works closer to the lower level side of Rust can give some insight on why AMD hardware is executing much slower than Intel hardware.

Here's a recorded sysprof session so whoever investigates this can see where the time is spent

sysprof-log-for-weird-performance.zip

@Shnatsel / @mmstick: Just to be safe, are your AMD users compiling with optimizations turned on, a la cargo build --release ...? If not, perhaps enabling native CPU optimizations with RUSTFLAGS="-C target-cpu=native" cargo build --release ... might actually get the program properly optimized. That could help lead us towards what's going on if it's still performing poorly.

Nope, enabling optimizations like that didn't help.

export RUSTFLAGS="-C target-cpu=native"
cargo build --release

Still slow.

I can also confirm that my AMD systems see no improvement from enabling native optimizations. I have perf data from both my Intel laptop and AMD desktop with both debug and release builds on the associated Rust issue: rust-lang/rust#36705

The solution to the problem is for Linux distributions to change their default parameter from always to madvise, as Solus did some time ago as they experienced issues with always: https://git.solus-project.com/packages/kernel/commit/?id=40b3b940348ce91ca7c03278f7f238a66883ad8f

Therefore, this should be reported as a bug against your Linux distribution.

For another datapoint, if you need to convince your distro, Fedora switched to madvise in 2014:
http://pkgs.fedoraproject.org/cgit/rpms/kernel.git/commit/?id=9a031d5070d9f8f5916c48637bd0c237cd52eaf9

I should clarify, I didn't mention the year for one-upmanship, but just to indicate experience with the change. If Solus only made this change a few weeks ago, other distros may wonder if they've fully experienced the fallout of that change. Fedora's longer period with madvise should inspire more confidence.

Considering Debian has been using it since 2012, I wonder why Ubuntu didn't decide to do the same.