JuliaLang/julia

Single threaded performance regression in FFT in Julia 0.5 RC3

mgr327 opened this issue · 11 comments

There seems to be a performance regression in fft as demonstrated below:

slowdown2.jl

function ch(nsteps, u)
    w = complex(u)

    p = plan_fft!(w)
    q = plan_ifft!(w)

    for n in 1:nsteps
        w = p*w
        w = q*w
    end
    w
end

srand(1)
u = rand(2^16)
ch(5, u)
@time ch(100, u)

Julia 0.4.6:

_% /usr/bin/julia slowdown2.jl
  0.299310 seconds (246 allocations: 1.015 MB)

Julia 0.5 RC3:

_% /usr/local/julia/bin/julia slowdown2.jl 
  3.104827 seconds (271 allocations: 1.015 MB)

Here is the versioninfo:

Julia 0.4.6:

_% /usr/bin/julia -e 'versioninfo()'
Julia Version 0.4.6
Commit 2e358ce* (2016-06-19 17:16 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Julia 0.5 RC3:

_% /usr/local/julia/bin/julia -e 'versioninfo()'
Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, westmere)

Duplicate of #17000

In the example above, both Julia 0.4.6 and Julia 0.5 RC3 run single-threaded (per the result reported by
'top -H' for 'ch(10000, u)'). So the problem seems to be different from the one discussed in #17000.

#17000 discusses two different issues, but only the first one is mentioned in title of #17000:
a) enabling multi threading by default; this needs a decision
b) Lack of optimization of FFTW due to buildbot misconfiguration; just needs to be fixed, and perhaps backported to 0.4 and 0.5
Perhaps the title of this issue could changed to:
"Single threaded performance regression in FFT in Julia 0.5 RC3"?
Furthermore, could you check if you can see the same performance regression, if you compile julia and its dependencies from source, instead of using the binary?

See the comment I linked (and the few below it). Check unsafe_string(cglobal((:fftw_cc, FFTW.libfftw), UInt8)) first before recompiling stuff.

If we're talking about single-threaded performance, then this is separate from #17000. And if it's entirely due to the buildbot configuration issue preventing the optimization flags, then that should be already fixed (for all branches, no need to backport anything) but we haven't had new binaries created to verify that yet.

With regards to the questions asked earlier:

(a) Self-compiled Julia 0.5 RC3 is as fast as Julia 0.4.6:

git clone git://github.com/JuliaLang/julia.git
cd julia/
git checkout release-0.5
make
_% /scratch/julia/julia -e 'versioninfo()'
Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, westmere)
/scratch/julia/julia slowdown2.jl 
  0.273984 seconds (271 allocations: 1.015 MB)

Just for the reference, downloaded "official" RC3 binaries showed a regression:

_% /usr/local/julia/bin/julia slowdown2.jl
  3.083323 seconds (271 allocations: 1.015 MB)

(b) Compilation flags:

   Julia 0.4.6
_% /usr/bin/julia -e '@printf "%s\n" bytestring(cglobal((:fftw_cc, FFTW.libfftw), UInt8))'
gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp
-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic
   Self-compiled Julia 0.5 RC3
_% /scratch/julia/julia -e '@printf "%s\n" unsafe_string(cglobal((:fftw_cc, FFTW.libfftw), UInt8))'
gcc -m64  -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math
    "Official" RC3 binaries:
_% /usr/local/julia/bin/julia -e '@printf "%s\n" unsafe_string(cglobal((:fftw_cc, FFTW.libfftw), UInt8))'
gcc -march=x86-64 -m64  -I/home/centos/local/include

I've pushed a new configuration change that should clear out CFLAGS and CPPFLAGS on the builders, explicitly for the make step via the buildbot. Looking at the environment variables printed at the top of every step's logfile, the currently running build is looking good.

EDIT: Nope, did it wrong, building anew, with a nuke to ensure FFTW is rebuilt.

BAM. It's working:

$ ./julia-e6f843b073/bin/julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.0-rc3+0 (2016-08-22 23:43 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-unknown-linux-gnu

julia> @printf "%s\n" unsafe_string(cglobal((:fftw_cc, FFTW.libfftw), UInt8))
gcc -march=x86-64 -m64  -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math

This particular build is available here. My personal feeling is that this isn't worth doing a new RC3 binary for, and we'll just let this roll out with RC4.

@mgr327 Thank you for your attention to detail here!

Is there anything left to do here, now that the buildbots are fixed?