Enable FFTW threading by default (to match up to performance of octave and others)

Question

Enable FFTW threading by default (to match up to performance of octave and others)

loganwilliams opened this issue 9 years ago · 64 comments

I've noticed that Julia is an order of magnitude slower to compute FFTs than GNU Octave. This discrepancy in speed confuses me, given that bought Octave and Julia ought to be calling the same FFTW library. Is this expected?

Times for Julia:

julia> R = rand(512,512);
julia> @time fft(R);
  0.042149 seconds (76 allocations: 8.003 MB)

julia> R = rand(5000,5000);
julia> @time fft(R);
  6.212666 seconds (76 allocations: 762.943 MB, 1.17% gc time)

Times for Octave:

>> R = rand(512,512);
>> tic; fft2(R); toc;
Elapsed time is 0.00377011 seconds.

>> R = rand(5000,5000);
>> tic; fft2(R); toc;
Elapsed time is 0.556037 seconds.

After setting FFTW.set_num_threads=2, and using rfft instead of fft, I saw a small improvement in Julia's performance, but a large discrepancy still remains.

julia> R = rand(512,512);
julia> @time rfft(R);
  0.018692 seconds (93 allocations: 2.012 MB)

julia> R = rand(5000,5000);
julia> @time rfft(R);
  1.736385 seconds (99 allocations: 190.816 MB, 0.95% gc time)

I have reproduced this issue on my personal computer (OS X 10.11.5), and on a Google Compute Engine VM running Ubuntu 16.04. Here is my Julia versioninfo():

Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Answer 1 · 2016-06-18T03:32:42.000Z

Please read: http://docs.julialang.org/en/release-0.4/manual/performance-tips/
In particular: 1) put the code you want to benchmark into functions 2) run this function twice, because when you call it the first time it is getting compiled.

Answer 2 · 2016-06-18T03:38:18.000Z

I had already read both tips. Wrapping fft(R) (which is a single function call) in a second function did nothing to improve performance. I also already ran each timing statement twice -- I just excluded the first, unhelpful, timing result for brevity's sake.

Answer 3 · 2016-06-18T04:01:20.000Z

Cc @stevengj

Answer 4 · 2016-06-18T12:40:03.000Z

Hello,
I benchmarked your example on my computer. My hardware: i7-2600K CPU @ 3.40GHz × 4
OS: Ubuntu Linux 14.04, 64 bits
CPU governor: performance (!)
Results:
R=rand(512,512)
Julia, 1 thread rfft(R): 1.6 - 1.7 ms
Julia, 2 threads rfft(R): 1.3 ms
Julia, 4 threads rfft(R): 1.1 ms
Octave fft2(R): 1.1 ms

R=rand(5000,5000)
Julia, 1 thread rfft(R): 0.31 .. 0.32 s
Julia, 2 threads rfft(R): 0.17 s
Julia, 4 threads rfft(R): 0.10 .. 011 s
Octave fft2(R): 0.17 s

Summary: Julia with two threads is about as fast as Octave. Julia with four threads is
much faster than Octave, but only for large problems.

Versioninfo:
Julia Version 0.4.2
Commit bb73f34 (2015-12-06 21:47 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_AFFINITY SANDYBRIDGE)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.3

GNU Octave, version 3.8.1
Copyright (C) 2014 John W. Eaton and others.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. For details, type 'warranty'.

Octave was configured for "x86_64-pc-linux-gnu".

Answer 5 · 2016-06-18T12:49:32.000Z

I upgraded to Julia 0.4.5 from the Ubuntu ppa, but the timing results do not change.
julia> versioninfo()
Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_AFFINITY SANDYBRIDGE)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.3

Answer 6 · 2016-06-18T17:45:55.000Z

Hmmm. Yesterday, I had thought I was able to reproduce this on a VM, but I must have been mistaken, because attempting it again now on an Ubuntu Google Cloud instances I get the same results that you have shown.

Answer 7 · 2016-06-18T18:18:00.000Z

You are right, I had a typo. It is 0.17s for the 5000x5000 matrix with octave. Does this mean, that the issue can be closed, or is there still a problem on OS X?

Answer 8 · 2016-06-18T18:20:58.000Z

I'm still experiencing the issue on my computer. Is there any additional information I can provide to help diagnose?

Answer 9 · 2016-06-18T19:24:38.000Z

Since this sort of issue has come up a few times, perhaps we should add documentation to the various FFT functions in Julia about what they are comparable to in other languages (Matlab, R, etc.).

Answer 10 · 2016-06-18T19:27:14.000Z

I agree it should be documented better but I don't think that was ever the issue here. (The very first post of this issue uses fft in julia and fft2 in octave).

Answer 11 · 2016-06-19T04:20:12.000Z

Alright, I found another Mac OS X computer to test this on, but it was quite old, running OS X 10.8.5.

for 512x512 matrix
.0089 seconds on average in Julia (4 threads, using rfft)
.0070 seconds on average in Octave (default threads (not sure what that is) using fft2)

for 5000x5000 matrix
1.27 seconds on average in Julia
1.19 seconds on average in Octave

Here, Julia seems just slightly slower than Octave.

Julia version info:

Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i5-3330S CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Octave version info: (running an older version because that was all I could get running quickly on OS 10.8.5.)

----------------------------------------------------------------------
GNU Octave Version 3.2.3
GNU Octave License: GNU General Public License
Operating System: Darwin 12.5.0 Darwin Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64
----------------------------------------------------------------------
no packages installed.

My computer running OS X 10.11.5 continues to exhibit the order of magnitude performance difference. Can anyone else reproduce this?

Answer 12 · 2016-06-19T04:27:29.000Z

Here's the result of profiling the rfft of a 5000x5000 matrix on my 10.11.5 computer: http://pastebin.com/nmr5Nyvn

Answer 13 · 2016-06-19T12:00:31.000Z

Julia on my MacBook 3x slower than Octave

julia> versioninfo()
Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM)2 Duo CPU     P7350  @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
julia> FFTW.set_num_threads(1);R=rand(5000,5000);@time a=rfft(R);
  7.072357 seconds (99 allocations: 190.816 MB, 0.32% gc time)

julia> FFTW.set_num_threads(2);R=rand(5000,5000);@time a=rfft(R);
  3.810004 seconds (99 allocations: 190.816 MB, 0.65% gc time)

octave:5> R=rand(5000,5000); tic, a=fft2(R); toc
Elapsed time is 1.35603 seconds.
octave:6> ver
----------------------------------------------------------------------
GNU Octave Version 3.8.2
GNU Octave License: GNU General Public License
Operating System: Darwin 15.5.0 Darwin Kernel Version 15.5.0: Tue Apr 19 18:36:36 PDT 2016; root:xnu-3248.50.21~8/RELEASE_X86_64 x86_64
----------------------------------------------------------------------
no packages installed.

Maybe of interest, size of libraries in /usr/local/octave/3.8.2/lib

# ls -al libfftw3f.3.dylib 
-rwxr-xr-x  1 root  admin  1678704 20 Elo  2014 libfftw3f.3.dylib

and `/Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia``

# ls -al libfftw3f.3.dylib
-rwxr-xr-x@ 1 jaakko  admin  6666908 18 Maa 03:12 libfftw3f.3.dylib

Used installation binaries
https://sourceforge.net/projects/octave/files/Octave%20MacOSX%20Binary/2014-09-25-Binary-of-GNU-Octave-3.8.2-for-OSX-10.9.5/
https://s3.amazonaws.com/julialang/bin/osx/x64/0.4/julia-0.4.5-osx10.7+.dmg

Answer 14 · 2016-06-19T18:01:06.000Z

Replacing the FFTW libraries from the Julia Mac OS X package with the FFTW libraries from the Octave Mac OS X package fixes the issue. Julia is now faster than Octave.

julia> R = rand(512,512);

julia> FFTW.set_num_threads(2);

julia> @time rfft(R);
  0.347049 seconds (390.80 k allocations: 19.633 MB, 1.41% gc time)

julia> @time rfft(R);
  0.001880 seconds (93 allocations: 2.012 MB)

julia> @time rfft(R);
  0.002026 seconds (93 allocations: 2.012 MB)

julia> @time rfft(R);
  0.003032 seconds (93 allocations: 2.012 MB, 788.36% gc time)

julia> R = rand(5000,5000);

julia> @time rfft(R);
  0.339954 seconds (99 allocations: 190.816 MB, 0.27% gc time)

julia> @time rfft(R);
  0.337303 seconds (99 allocations: 190.816 MB, 2.07% gc time)

julia> @time rfft(R);
  0.330447 seconds (99 allocations: 190.816 MB, 9.41% gc time)

Answer 15 · 2016-06-19T18:34:01.000Z

It would seem interesting to know the difference in how the two libraries were compiled.

Answer 16 · 2016-06-20T06:58:47.000Z

The speed of rfft on my machine with the default Julia FFTW libraries is close to @loganwilliams 's post above

julia> FFTW.set_num_threads(2);

julia> R = rand(512,512);

julia> @time rfft(R);
  0.003687 seconds (88 allocations: 2.013 MB)

julia> @time rfft(R);
  0.003386 seconds (88 allocations: 2.013 MB)

julia> @time rfft(R);
  0.003434 seconds (88 allocations: 2.013 MB)

julia> @time rfft(R);
  0.003323 seconds (88 allocations: 2.013 MB)
julia> R = rand(5000,5000);

julia> @time rfft(R);
  0.257353 seconds (93 allocations: 190.816 MB, 0.50% gc time)

julia> @time rfft(R);
  0.254915 seconds (93 allocations: 190.816 MB, 6.54% gc time)

julia> @time rfft(R);
  0.271649 seconds (93 allocations: 190.816 MB, 36.09% gc time)

julia> @time rfft(R);
  0.270884 seconds (93 allocations: 190.816 MB, 0.61% gc time)

julia> @time rfft(R);
  0.274651 seconds (93 allocations: 190.816 MB, 0.59% gc time)

julia> @time rfft(R);
  0.268775 seconds (93 allocations: 190.816 MB, 0.62% gc time)

The above test is just on master version of Julia which is a few days old .
I updated my Julia to the latest today but the performance is similar.

julia> versioninfo()
Julia Version 0.5.0-dev+4877
Commit 02ac2b1* (2016-06-20 22:32 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

octave:1>  R=rand(5000,5000); tic, a=fft2(R); toc
Elapsed time is 0.22766 seconds.
octave:2>  R=rand(5000,5000); tic, a=fft2(R); toc
Elapsed time is 0.201882 seconds.
octave:3>  R=rand(5000,5000); tic, a=fft2(R); toc
Elapsed time is 0.173801 seconds.
octave:4> ver
----------------------------------------------------------------------
GNU Octave Version 3.8.1
GNU Octave License: GNU General Public License
Operating System: Linux 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64
----------------------------------------------------------------------
Package Name  | Version | Installation directory
--------------+---------+-----------------------
          io  |   2.2.9 | /home/guo/octave/io-2.2.9
         mpi *|   1.1.1 | /usr/share/octave/packages/mpi-1.1.1
  statistics  |   1.2.4 | /home/guo/octave/statistics-1.2.4

Answer 17 · 2016-06-20T15:17:36.000Z

@zhmz90: Could you please first, mention which computer (cpu, clock speed) you use, and second, also test the speed with octave?

Answer 18 · 2016-06-21T05:50:58.000Z

@ufechner7 I have added my test to the above post. The result shows fft2 in Octave is faster than rfft in Julia. In Julia, I set the FFTW.set_num_threads(2); while in octave did noting since I am not familiar with octave.

Answer 19 · 2016-06-22T04:47:26.000Z

@zhmz90: Is it a mac or Linux machine? If Linux, which distribution/ version?

Answer 20 · 2016-06-22T04:53:02.000Z

@ufechner7

guo@x02:~$ uname -a
Linux x02 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Answer 21 · 2016-06-23T18:44:51.000Z

Who maintains the OS X package distribution?

Answer 22 · 2016-06-24T01:14:38.000Z

which one? exactly how did you install julia and which package distribution are you referring to?

Answer 23 · 2016-06-24T02:03:09.000Z

@tkelman Uh, the one displayed very prominently on Julia's web page: http://julialang.org/downloads/

Exact version info is in my first post.

I have resolved my personal issue by replacing the libraries with libraries from Octave's distribution, but as @timholy noted, "It would seem interesting to know the difference in how the two libraries were compiled."

Answer 24 · 2016-06-24T11:55:07.000Z

Just wanted to check that you weren't getting it from homebrew or similar. So that build is produced from running a complete source build on our mac buildbots. Our makefile flags for fftw can be found under deps (maybe a handful of related flags in Make.inc but I think those are mostly enabling or disabling different dependencies).

Answer 25 · 2016-07-07T23:46:45.000Z

I can report similar performance issues on my machine (Mac OS X 10.11.5, 4GHz Intel Core i7). Octave is about 4 times faster than Julia to compute FFTs. As @loganwilliams suggested I copied the fftw libraries from the Octave package and this improved things. But Julia is still about 50% slower than Octave. See below for the results.

In Octave (freshly installed from: https://sourceforge.net/projects/octave/files/Octave%20MacOSX%20Binary/2016-06-06-binary-octave-4.0.2/octave_gui_402.dmg/download )

>> fftw('threads')
ans =  8
>> x = randn(5000,5000); tic; y=fft2(x); toc
Elapsed time is 0.348789 seconds.
>> x = randn(5000,5000); tic; y=fft2(x); toc
Elapsed time is 0.37518 seconds.
>> x = randn(5000,5000); tic; y=fft2(x); toc
Elapsed time is 0.369055 seconds.
>> ver
----------------------------------------------------------------------
GNU Octave Version: 4.0.2
GNU Octave License: GNU General Public License
Operating System: Darwin 15.5.0 Darwin Kernel Version 15.5.0: Tue Apr 19 18:36:36 PDT 2016; root:xnu-3248.50.21~8/RELEASE_X86_64 x86_64
----------------------------------------------------------------------
no packages installed.

Julia with FFTW shipping with Julia package: https://s3.amazonaws.com/julialang/bin/osx/x64/0.4/julia-0.4.6-osx10.7+.dmg

julia> FFTW.set_num_threads(8)

julia> x = randn(5000,5000); @time y=fft(x);
  1.178509 seconds (77 allocations: 762.943 MB, 2.92% gc time)

julia> x = randn(5000,5000); @time y=fft(x);
  1.201879 seconds (76 allocations: 762.943 MB, 4.64% gc time)

julia> x = randn(5000,5000); @time y=fft(x);
  1.327285 seconds (76 allocations: 762.943 MB, 6.36% gc time)

julia> versioninfo()
Julia Version 0.4.6
Commit 2e358ce (2016-06-19 17:16 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Julia with FFTW from octave package.

julia> FFTW.set_num_threads(8)

julia> x = randn(5000,5000); @time y=fft(x);
  0.512738 seconds (76 allocations: 762.943 MB, 20.49% gc time)

julia> x = randn(5000,5000); @time y=fft(x);
  0.506314 seconds (76 allocations: 762.943 MB, 12.40% gc time)

julia> x = randn(5000,5000); @time y=fft(x);
  0.503848 seconds (76 allocations: 762.943 MB, 16.77% gc time)

Answer 26 · 2016-07-21T19:14:41.000Z

If it helps anyone, I had the same issue and I copied the fftw3* library files from julia 0.4.5 to julia 0.4.6 and recovered similar runtime to Matlab.

Answer 27 · 2016-08-02T06:39:40.000Z

I have a related question: isn't Julia starting up only with 1 FFTW thread by default? If so, why is that done when OpenBLAS is made to start with multiple threads?

Answer 28 · 2016-08-02T07:25:41.000Z

I did a threadwise comparison of the 2 functions, rfft in Julia and fft2 in Octave. Below are the timings in seconds.

Threads	Julia 0.5 - `rfft`	Octave - `fft2`
1	0.707015	0.91564
2	0.346071	0.640057
4	0.319295	0.57852
8	0.323513	0.517388

julia> versioninfo()
Julia Version 0.5.0-pre+5607
Commit b510ad9 (2016-07-22 05:36 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.4.0)
  CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)

On my machine by default Octave starts with 4 threads and Julia starts with just 1, like @ranjanan mentioned above.

To find the number of threads and to change the same in Octave, use fftw("threads") and fftw("threads", 2).

Answer 29 · 2016-08-02T07:33:34.000Z

Maybe there's a case for starting FFTW up with multiple threads (like 4?) by default?

Answer 30 · 2016-08-02T08:54:50.000Z

Why not just put FFTW.set_num_threads(Sys.CPU_CORES) in the julia boot?

Answer 31 · 2016-08-02T10:21:48.000Z

Mentioning #17429 here. FFTW threading should be enabled by default only on the master.

Answer 32 · 2016-08-02T10:31:57.000Z

Why not just put FFTW.set_num_threads(Sys.CPU_CORES) in the julia boot?

Can complicate things for parallel. Please see #17726 (comment) and comment if the proposal is fine.

Answer 33 · 2016-08-02T11:18:02.000Z

The parallel stuff is already complicated by BLAS and Julia's own threads now. I did see #17726 but did not have a well formed opinion on the topic.

Answer 34 · 2016-08-04T07:51:44.000Z

FFTW should be moved to a package during 0.6 anyway.

Answer 35 · 2016-08-23T22:19:54.000Z

Somehow this issue got out of hand. Title has been changed, but the original issue has not been addressed.

I can confirm the observation by @loganwilliams that copying libfftw* from Octave speeds up Julia performance in my case by a factor of 4. This is for both 0.5.0-rc3+0 and 0.4.5 obtained from http://julialang.org/downloads/ . Octave was from https://sourceforge.net/projects/octave/files/Octave%20MacOSX%20Binary/2014-09-25-Binary-of-GNU-Octave-3.8.2-for-OSX-10.9.5/ .

julia> Sys.CPU_CORES
2

julia> FFTW.set_num_threads(Sys.CPU_CORES)

julia> R=rand(5000,5000);@time a=rfft(R);
  4.953068 seconds (389.04 k allocations: 208.333 MB, 1.36% gc time)

julia> R=rand(5000,5000);@time a=rfft(R);
  3.971006 seconds (99 allocations: 190.816 MB, 0.64% gc time)

julia> R=rand(5000,5000);@time a=rfft(R);
  3.925199 seconds (99 allocations: 190.816 MB, 0.48% gc time)

# cp /usr/local/octave/3.8.2/lib/libfft* /Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia

julia> FFTW.set_num_threads(Sys.CPU_CORES)

julia> R=rand(5000,5000);@time a=rfft(R);
  1.781723 seconds (389.98 k allocations: 208.876 MB, 3.56% gc time)

julia> R=rand(5000,5000);@time a=rfft(R);
  0.986160 seconds (99 allocations: 190.816 MB, 2.79% gc time)

julia> R=rand(5000,5000);@time a=rfft(R);
  0.986806 seconds (99 allocations: 190.816 MB, 1.94% gc time)

julia> versioninfo()
Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM)2 Duo CPU     P7350  @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Similar numbers for the 0.5.0-rc3. The numbers improve 4x if Octave libfftw* files are copied, just like for version 0.4.5.

julia> versioninfo()
Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM)2 Duo CPU     P7350  @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, penryn)
julia> FFTW.set_num_threads(Sys.CPU_CORES)

julia> R=rand(5000,5000);@time a=rfft(R);
  5.110154 seconds (482.33 k allocations: 211.897 MB, 2.41% gc time)

julia> R=rand(5000,5000);@time a=rfft(R);
  4.041525 seconds (95 allocations: 190.816 MB, 1.02% gc time)

julia> R=rand(5000,5000);@time a=rfft(R);
  4.047888 seconds (95 allocations: 190.816 MB, 0.93% gc time)

Answer 36 · 2016-08-23T22:35:03.000Z

Well, we can look into what octave is doing differently than us when they build FFTW (keeping in mind the difference in license, I assume their makefiles are also GPL licensed). Are they building binaries less conservatively in terms of the hardware instruction sets they support?

Answer 37 · 2016-08-23T22:44:10.000Z

For the comparison I used Octave-3.8.2, that is the last version of the binaries that supports my cpu.

Answer 38 · 2016-08-23T23:15:20.000Z

FWIW, just by looking at the strings of the dylib-files, one finds for Octave versions

fftw-3.3.4-sse2
fftw-3.3.4
/usr/local/octave/3.8.2/bin/gcc-mp-4.9 -std=gnu99 -pipe -Os -fno-common -O3 -fomit-frame-pointer -fstrict-aliasing

and for Julia versions

fftw-3.3.4-fma-sse2-avx
clang -stdlib=libc++ -mmacosx-version-min=10.7 -march=core2 -integrated-as -m64  -I/usr/local/include

Answer 39 · 2016-08-23T23:19:18.000Z

Ah, is this another case where CFLAGS from the buildbot environment are resulting in no -O flags getting sent to dependency libraries? cc @staticfloat I'd really like to get rid of all your profile environment variables from the buildbots, since these can be tough to track down.

Answer 40 · 2016-08-23T23:38:29.000Z

Ouch. Yes, FFTW stores the compiler and flags, and they can be retrieved by the ~~functions~~ global variables fftw_cc and fftw_codelet_optim, both of which return const char*.

(cc is the compiler and default flags used for most of FFTW, and codelet_optim is any changes to the flags for the computational kernels.)

Answer 41 · 2016-08-23T23:43:43.000Z

On MacOS with Julia master built from source, I get

julia> unsafe_string(cglobal((:fftw_cc, FFTW.libfftw), UInt8))
"clang -stdlib=libc++ -mmacosx-version-min=10.7 -m64  -O3 -fomit-frame-pointer -mtune=native -fstrict-aliasing -fno-schedule-insns -ffast-math"

julia> unsafe_string(cglobal((:fftw_codelet_optim, FFTW.libfftw), UInt8))
""

which looks okay.

Answer 42 · 2016-08-23T23:45:18.000Z

With the official Julia 0.4.6 binary on MacOS, I get

julia> bytestring(cglobal((:fftw_cc, FFTW.libfftw), UInt8))
"clang -stdlib=libc++ -mmacosx-version-min=10.7 -march=core2 -integrated-as -m64  -I/usr/local/include"

julia> bytestring(cglobal((:fftw_codelet_optim, FFTW.libfftw), UInt8))
""

which looks like zero compiler optimizations, which will definitely hurt performance.

Answer 43 · 2016-08-23T23:48:22.000Z

This is the same issue as #17751 (comment). The mac buildbot has a whole bunch of profile scripts that set CFLAGS and a number of other environment variables that are overriding things. It's potentially a fixable bug in various upstream build systems as in JuliaMath/openlibm#142, but that's way messier to do for all of our upstreams than to remove the profile scripts.

Answer 44 · 2016-08-23T23:48:32.000Z

(Here is how FFTW picks its cflags: https://github.com/FFTW/fftw3/blob/master/m4/ax_cc_maxopt.m4)

Answer 45 · 2016-08-23T23:51:32.000Z

Though the lack of optimization flags due to buildbot misconfiguration is a separate issue than the current title of

Enable FFTW threading by default

it is contributing to the difference in performance even when using the same number of threads.

Answer 46 · 2016-08-24T05:04:47.000Z

So a second issue should be created, like "Lack of optimization of FFTW due to buildbot misconfiguration"

Answer 47 · 2016-08-24T06:56:37.000Z

Or just change the title of this issue?

Answer 48 · 2016-08-24T07:15:48.000Z

Enable FFTW threading by default

Is a valid, but separate, issue to the buildbot flags problem. The latter should now be resolved, I believe.

Answer 49 · 2016-08-24T10:56:49.000Z

It is not only MacOS.
Both Win64 and generic Linux precompiled binaries (version 0.4.5) return:

julia> bytestring(cglobal((:fftw_codelet_optim, FFTW.libfftw), UInt8))
""

Answer 50 · 2016-08-24T12:03:59.000Z

@GaborOszlanyi, that just means that the codelets are compiled with the same flags as the rest of FFTW, which is not necessarily a problem. Look at bytestring(cglobal((:fftw_cc, FFTW.libfftw), UInt8)).

Answer 51 · 2016-08-24T13:30:34.000Z

fftw_cc is OK in both cases.
Sorry for the misunderstanding.

Answer 52 · 2016-08-24T17:09:45.000Z

I still would like to create a second issue, because this are two different issues:
a) enabling multi threading by default; this needs a decision
b) Lack of optimization of FFTW due to buildbot misconfiguration; just needs to be fixed, and perhaps backported to 0.4 and 0.5
This are two different issues. If nobody opposes my proposal, I open a second issue.

Answer 53 · 2016-08-24T23:21:05.000Z

@ufechner7 the second issue was already fixed, and does not require any changes in this repository.

Answer 54 · 2016-08-25T07:14:24.000Z

So issue b) this will be fixed in the next binary releases of 0.4 and 0.5? That would be nice.

You write, that fixing the buildbots "does not require any changes in this repository". Is there a separate repository for the buildbots?

Answer 55 · 2016-08-25T07:15:36.000Z

Yep, the main one is wittingly named julia-buildbot

Answer 56 · 2016-08-25T07:32:05.000Z

issue b) this will be fixed in the next binary releases of 0.4 and 0.5

Should be, if the fix was complete and correct. Once we resolve #18079 it should also be testable with 0.6-dev nightlies.

Answer 57 · 2016-08-26T13:34:31.000Z

As of yesterday the 0.4.6 Linux 64 binaries (julia-2e358ce975) shipped with the slow fftw libraries, which looked the same as the ones in 0.5 binaries.

As I noted before, the binaries from 0.4.5 (2ac304d) are good, using those in 0.4.6 buys you a factor 5 speedup.

Cheers!

Answer 58 · 2016-08-26T13:39:21.000Z

@tkelman Is this something we can fix in 0.5.x?

Answer 59 · 2016-08-26T13:51:52.000Z

Are you asking about multithreading, or are you asking about the other issue which now has its own #18245?

Enabling threading by default is a bit much of a behavior change to backport I think.

Answer 60 · 2016-08-26T14:26:21.000Z

I thought #18245 was referring here for the fix for single-threaded perf.

I don't think we should backport threading by default, but we should probably do it on master sooner rather than later for 0.6.

Answer 61 · 2016-08-26T14:32:14.000Z

Let's keep the discussions separate from now on. This issue is titled

Enable FFTW threading by default

and should stay focused on that going forward if we can.

Answer 62 · 2016-09-07T03:37:58.000Z

Just a side question to @tkelman out of curiosity, will FFTW move to a package in favor of other FFT implementation like MKL? is it related to the GPL of FFTW? thanks 😃

Answer 63 · 2017-07-21T17:42:29.000Z

This issue should be reopened on FFTW.jl

Answer 64 · 2019-08-04T19:36:07.000Z

FFTW integration with julia's partr threads: JuliaMath/FFTW.jl#105