Enable FFTW threading by default (to match up to performance of octave and others)
loganwilliams opened this issue ยท 64 comments
I've noticed that Julia is an order of magnitude slower to compute FFTs than GNU Octave. This discrepancy in speed confuses me, given that bought Octave and Julia ought to be calling the same FFTW library. Is this expected?
Times for Julia:
julia> R = rand(512,512);
julia> @time fft(R);
0.042149 seconds (76 allocations: 8.003 MB)
julia> R = rand(5000,5000);
julia> @time fft(R);
6.212666 seconds (76 allocations: 762.943 MB, 1.17% gc time)
Times for Octave:
>> R = rand(512,512);
>> tic; fft2(R); toc;
Elapsed time is 0.00377011 seconds.
>> R = rand(5000,5000);
>> tic; fft2(R); toc;
Elapsed time is 0.556037 seconds.
After setting FFTW.set_num_threads=2, and using rfft instead of fft, I saw a small improvement in Julia's performance, but a large discrepancy still remains.
julia> R = rand(512,512);
julia> @time rfft(R);
0.018692 seconds (93 allocations: 2.012 MB)
julia> R = rand(5000,5000);
julia> @time rfft(R);
1.736385 seconds (99 allocations: 190.816 MB, 0.95% gc time)
I have reproduced this issue on my personal computer (OS X 10.11.5), and on a Google Compute Engine VM running Ubuntu 16.04. Here is my Julia versioninfo():
Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
Please read: http://docs.julialang.org/en/release-0.4/manual/performance-tips/
In particular: 1) put the code you want to benchmark into functions 2) run this function twice, because when you call it the first time it is getting compiled.
I had already read both tips. Wrapping fft(R) (which is a single function call) in a second function did nothing to improve performance. I also already ran each timing statement twice -- I just excluded the first, unhelpful, timing result for brevity's sake.
Cc @stevengj
Hello,
I benchmarked your example on my computer. My hardware: i7-2600K CPU @ 3.40GHz ร 4
OS: Ubuntu Linux 14.04, 64 bits
CPU governor: performance (!)
Results:
R=rand(512,512)
Julia, 1 thread rfft(R): 1.6 - 1.7 ms
Julia, 2 threads rfft(R): 1.3 ms
Julia, 4 threads rfft(R): 1.1 ms
Octave fft2(R): 1.1 ms
R=rand(5000,5000)
Julia, 1 thread rfft(R): 0.31 .. 0.32 s
Julia, 2 threads rfft(R): 0.17 s
Julia, 4 threads rfft(R): 0.10 .. 011 s
Octave fft2(R): 0.17 s
Summary: Julia with two threads is about as fast as Octave. Julia with four threads is
much faster than Octave, but only for large problems.
Versioninfo:
Julia Version 0.4.2
Commit bb73f34 (2015-12-06 21:47 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_AFFINITY SANDYBRIDGE)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.3
GNU Octave, version 3.8.1
Copyright (C) 2014 John W. Eaton and others.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. For details, type 'warranty'.
Octave was configured for "x86_64-pc-linux-gnu".
I upgraded to Julia 0.4.5 from the Ubuntu ppa, but the timing results do not change.
julia> versioninfo()
Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_AFFINITY SANDYBRIDGE)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.3
Hmmm. Yesterday, I had thought I was able to reproduce this on a VM, but I must have been mistaken, because attempting it again now on an Ubuntu Google Cloud instances I get the same results that you have shown.
You are right, I had a typo. It is 0.17s for the 5000x5000 matrix with octave. Does this mean, that the issue can be closed, or is there still a problem on OS X?
I'm still experiencing the issue on my computer. Is there any additional information I can provide to help diagnose?
Since this sort of issue has come up a few times, perhaps we should add documentation to the various FFT functions in Julia about what they are comparable to in other languages (Matlab, R, etc.).
I agree it should be documented better but I don't think that was ever the issue here. (The very first post of this issue uses fft in julia and fft2 in octave).
Alright, I found another Mac OS X computer to test this on, but it was quite old, running OS X 10.8.5.
for 512x512 matrix
.0089 seconds on average in Julia (4 threads, using rfft)
.0070 seconds on average in Octave (default threads (not sure what that is) using fft2)
for 5000x5000 matrix
1.27 seconds on average in Julia
1.19 seconds on average in Octave
Here, Julia seems just slightly slower than Octave.
Julia version info:
Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i5-3330S CPU @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
Octave version info: (running an older version because that was all I could get running quickly on OS 10.8.5.)
----------------------------------------------------------------------
GNU Octave Version 3.2.3
GNU Octave License: GNU General Public License
Operating System: Darwin 12.5.0 Darwin Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64
----------------------------------------------------------------------
no packages installed.
My computer running OS X 10.11.5 continues to exhibit the order of magnitude performance difference. Can anyone else reproduce this?
Here's the result of profiling the rfft of a 5000x5000 matrix on my 10.11.5 computer: http://pastebin.com/nmr5Nyvn
Julia on my MacBook 3x slower than Octave
julia> versioninfo()
Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
julia> FFTW.set_num_threads(1);R=rand(5000,5000);@time a=rfft(R);
7.072357 seconds (99 allocations: 190.816 MB, 0.32% gc time)
julia> FFTW.set_num_threads(2);R=rand(5000,5000);@time a=rfft(R);
3.810004 seconds (99 allocations: 190.816 MB, 0.65% gc time)
octave:5> R=rand(5000,5000); tic, a=fft2(R); toc
Elapsed time is 1.35603 seconds.
octave:6> ver
----------------------------------------------------------------------
GNU Octave Version 3.8.2
GNU Octave License: GNU General Public License
Operating System: Darwin 15.5.0 Darwin Kernel Version 15.5.0: Tue Apr 19 18:36:36 PDT 2016; root:xnu-3248.50.21~8/RELEASE_X86_64 x86_64
----------------------------------------------------------------------
no packages installed.
Maybe of interest, size of libraries in /usr/local/octave/3.8.2/lib
# ls -al libfftw3f.3.dylib
-rwxr-xr-x 1 root admin 1678704 20 Elo 2014 libfftw3f.3.dylib
and `/Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia``
# ls -al libfftw3f.3.dylib
-rwxr-xr-x@ 1 jaakko admin 6666908 18 Maa 03:12 libfftw3f.3.dylib
Used installation binaries
https://sourceforge.net/projects/octave/files/Octave%20MacOSX%20Binary/2014-09-25-Binary-of-GNU-Octave-3.8.2-for-OSX-10.9.5/
https://s3.amazonaws.com/julialang/bin/osx/x64/0.4/julia-0.4.5-osx10.7+.dmg
Replacing the FFTW libraries from the Julia Mac OS X package with the FFTW libraries from the Octave Mac OS X package fixes the issue. Julia is now faster than Octave.
julia> R = rand(512,512);
julia> FFTW.set_num_threads(2);
julia> @time rfft(R);
0.347049 seconds (390.80 k allocations: 19.633 MB, 1.41% gc time)
julia> @time rfft(R);
0.001880 seconds (93 allocations: 2.012 MB)
julia> @time rfft(R);
0.002026 seconds (93 allocations: 2.012 MB)
julia> @time rfft(R);
0.003032 seconds (93 allocations: 2.012 MB, 788.36% gc time)
julia> R = rand(5000,5000);
julia> @time rfft(R);
0.339954 seconds (99 allocations: 190.816 MB, 0.27% gc time)
julia> @time rfft(R);
0.337303 seconds (99 allocations: 190.816 MB, 2.07% gc time)
julia> @time rfft(R);
0.330447 seconds (99 allocations: 190.816 MB, 9.41% gc time)
It would seem interesting to know the difference in how the two libraries were compiled.
The speed of rfft on my machine with the default Julia FFTW libraries is close to @loganwilliams 's post above
julia> FFTW.set_num_threads(2);
julia> R = rand(512,512);
julia> @time rfft(R);
0.003687 seconds (88 allocations: 2.013 MB)
julia> @time rfft(R);
0.003386 seconds (88 allocations: 2.013 MB)
julia> @time rfft(R);
0.003434 seconds (88 allocations: 2.013 MB)
julia> @time rfft(R);
0.003323 seconds (88 allocations: 2.013 MB)
julia> R = rand(5000,5000);
julia> @time rfft(R);
0.257353 seconds (93 allocations: 190.816 MB, 0.50% gc time)
julia> @time rfft(R);
0.254915 seconds (93 allocations: 190.816 MB, 6.54% gc time)
julia> @time rfft(R);
0.271649 seconds (93 allocations: 190.816 MB, 36.09% gc time)
julia> @time rfft(R);
0.270884 seconds (93 allocations: 190.816 MB, 0.61% gc time)
julia> @time rfft(R);
0.274651 seconds (93 allocations: 190.816 MB, 0.59% gc time)
julia> @time rfft(R);
0.268775 seconds (93 allocations: 190.816 MB, 0.62% gc time)
The above test is just on master version of Julia which is a few days old .
I updated my Julia to the latest today but the performance is similar.
julia> versioninfo()
Julia Version 0.5.0-dev+4877
Commit 02ac2b1* (2016-06-20 22:32 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
octave:1> R=rand(5000,5000); tic, a=fft2(R); toc
Elapsed time is 0.22766 seconds.
octave:2> R=rand(5000,5000); tic, a=fft2(R); toc
Elapsed time is 0.201882 seconds.
octave:3> R=rand(5000,5000); tic, a=fft2(R); toc
Elapsed time is 0.173801 seconds.
octave:4> ver
----------------------------------------------------------------------
GNU Octave Version 3.8.1
GNU Octave License: GNU General Public License
Operating System: Linux 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64
----------------------------------------------------------------------
Package Name | Version | Installation directory
--------------+---------+-----------------------
io | 2.2.9 | /home/guo/octave/io-2.2.9
mpi *| 1.1.1 | /usr/share/octave/packages/mpi-1.1.1
statistics | 1.2.4 | /home/guo/octave/statistics-1.2.4@zhmz90: Could you please first, mention which computer (cpu, clock speed) you use, and second, also test the speed with octave?
@ufechner7 I have added my test to the above post. The result shows fft2 in Octave is faster than rfft in Julia. In Julia, I set the FFTW.set_num_threads(2); while in octave did noting since I am not familiar with octave.
guo@x02:~$ uname -a
Linux x02 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Who maintains the OS X package distribution?
which one? exactly how did you install julia and which package distribution are you referring to?
@tkelman Uh, the one displayed very prominently on Julia's web page: http://julialang.org/downloads/
Exact version info is in my first post.
I have resolved my personal issue by replacing the libraries with libraries from Octave's distribution, but as @timholy noted, "It would seem interesting to know the difference in how the two libraries were compiled."
Just wanted to check that you weren't getting it from homebrew or similar. So that build is produced from running a complete source build on our mac buildbots. Our makefile flags for fftw can be found under deps (maybe a handful of related flags in Make.inc but I think those are mostly enabling or disabling different dependencies).
I can report similar performance issues on my machine (Mac OS X 10.11.5, 4GHz Intel Core i7). Octave is about 4 times faster than Julia to compute FFTs. As @loganwilliams suggested I copied the fftw libraries from the Octave package and this improved things. But Julia is still about 50% slower than Octave. See below for the results.
In Octave (freshly installed from: https://sourceforge.net/projects/octave/files/Octave%20MacOSX%20Binary/2016-06-06-binary-octave-4.0.2/octave_gui_402.dmg/download )
>> fftw('threads')
ans = 8
>> x = randn(5000,5000); tic; y=fft2(x); toc
Elapsed time is 0.348789 seconds.
>> x = randn(5000,5000); tic; y=fft2(x); toc
Elapsed time is 0.37518 seconds.
>> x = randn(5000,5000); tic; y=fft2(x); toc
Elapsed time is 0.369055 seconds.
>> ver
----------------------------------------------------------------------
GNU Octave Version: 4.0.2
GNU Octave License: GNU General Public License
Operating System: Darwin 15.5.0 Darwin Kernel Version 15.5.0: Tue Apr 19 18:36:36 PDT 2016; root:xnu-3248.50.21~8/RELEASE_X86_64 x86_64
----------------------------------------------------------------------
no packages installed.
Julia with FFTW shipping with Julia package: https://s3.amazonaws.com/julialang/bin/osx/x64/0.4/julia-0.4.6-osx10.7+.dmg
julia> FFTW.set_num_threads(8)
julia> x = randn(5000,5000); @time y=fft(x);
1.178509 seconds (77 allocations: 762.943 MB, 2.92% gc time)
julia> x = randn(5000,5000); @time y=fft(x);
1.201879 seconds (76 allocations: 762.943 MB, 4.64% gc time)
julia> x = randn(5000,5000); @time y=fft(x);
1.327285 seconds (76 allocations: 762.943 MB, 6.36% gc time)
julia> versioninfo()
Julia Version 0.4.6
Commit 2e358ce (2016-06-19 17:16 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
Julia with FFTW from octave package.
julia> FFTW.set_num_threads(8)
julia> x = randn(5000,5000); @time y=fft(x);
0.512738 seconds (76 allocations: 762.943 MB, 20.49% gc time)
julia> x = randn(5000,5000); @time y=fft(x);
0.506314 seconds (76 allocations: 762.943 MB, 12.40% gc time)
julia> x = randn(5000,5000); @time y=fft(x);
0.503848 seconds (76 allocations: 762.943 MB, 16.77% gc time)
If it helps anyone, I had the same issue and I copied the fftw3* library files from julia 0.4.5 to julia 0.4.6 and recovered similar runtime to Matlab.
I have a related question: isn't Julia starting up only with 1 FFTW thread by default? If so, why is that done when OpenBLAS is made to start with multiple threads?
I did a threadwise comparison of the 2 functions, rfft in Julia and fft2 in Octave. Below are the timings in seconds.
| Threads | Julia 0.5 - rfft |
Octave - fft2 |
|---|---|---|
| 1 | 0.707015 | 0.91564 |
| 2 | 0.346071 | 0.640057 |
| 4 | 0.319295 | 0.57852 |
| 8 | 0.323513 | 0.517388 |
julia> versioninfo()
Julia Version 0.5.0-pre+5607
Commit b510ad9 (2016-07-22 05:36 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin14.4.0)
CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
On my machine by default Octave starts with 4 threads and Julia starts with just 1, like @ranjanan mentioned above.
To find the number of threads and to change the same in Octave, use fftw("threads") and fftw("threads", 2).
Maybe there's a case for starting FFTW up with multiple threads (like 4?) by default?
Why not just put FFTW.set_num_threads(Sys.CPU_CORES) in the julia boot?
Mentioning #17429 here. FFTW threading should be enabled by default only on the master.
Why not just put FFTW.set_num_threads(Sys.CPU_CORES) in the julia boot?
Can complicate things for parallel. Please see #17726 (comment) and comment if the proposal is fine.
The parallel stuff is already complicated by BLAS and Julia's own threads now. I did see #17726 but did not have a well formed opinion on the topic.
FFTW should be moved to a package during 0.6 anyway.
Somehow this issue got out of hand. Title has been changed, but the original issue has not been addressed.
I can confirm the observation by @loganwilliams that copying libfftw* from Octave speeds up Julia performance in my case by a factor of 4. This is for both 0.5.0-rc3+0 and 0.4.5 obtained from http://julialang.org/downloads/ . Octave was from https://sourceforge.net/projects/octave/files/Octave%20MacOSX%20Binary/2014-09-25-Binary-of-GNU-Octave-3.8.2-for-OSX-10.9.5/ .
julia> Sys.CPU_CORES
2
julia> FFTW.set_num_threads(Sys.CPU_CORES)
julia> R=rand(5000,5000);@time a=rfft(R);
4.953068 seconds (389.04 k allocations: 208.333 MB, 1.36% gc time)
julia> R=rand(5000,5000);@time a=rfft(R);
3.971006 seconds (99 allocations: 190.816 MB, 0.64% gc time)
julia> R=rand(5000,5000);@time a=rfft(R);
3.925199 seconds (99 allocations: 190.816 MB, 0.48% gc time)
# cp /usr/local/octave/3.8.2/lib/libfft* /Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia
julia> FFTW.set_num_threads(Sys.CPU_CORES)
julia> R=rand(5000,5000);@time a=rfft(R);
1.781723 seconds (389.98 k allocations: 208.876 MB, 3.56% gc time)
julia> R=rand(5000,5000);@time a=rfft(R);
0.986160 seconds (99 allocations: 190.816 MB, 2.79% gc time)
julia> R=rand(5000,5000);@time a=rfft(R);
0.986806 seconds (99 allocations: 190.816 MB, 1.94% gc time)
julia> versioninfo()
Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
Similar numbers for the 0.5.0-rc3. The numbers improve 4x if Octave libfftw* files are copied, just like for version 0.4.5.
julia> versioninfo()
Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, penryn)
julia> FFTW.set_num_threads(Sys.CPU_CORES)
julia> R=rand(5000,5000);@time a=rfft(R);
5.110154 seconds (482.33 k allocations: 211.897 MB, 2.41% gc time)
julia> R=rand(5000,5000);@time a=rfft(R);
4.041525 seconds (95 allocations: 190.816 MB, 1.02% gc time)
julia> R=rand(5000,5000);@time a=rfft(R);
4.047888 seconds (95 allocations: 190.816 MB, 0.93% gc time)
Well, we can look into what octave is doing differently than us when they build FFTW (keeping in mind the difference in license, I assume their makefiles are also GPL licensed). Are they building binaries less conservatively in terms of the hardware instruction sets they support?
For the comparison I used Octave-3.8.2, that is the last version of the binaries that supports my cpu.
FWIW, just by looking at the strings of the dylib-files, one finds for Octave versions
fftw-3.3.4-sse2
fftw-3.3.4
/usr/local/octave/3.8.2/bin/gcc-mp-4.9 -std=gnu99 -pipe -Os -fno-common -O3 -fomit-frame-pointer -fstrict-aliasing
and for Julia versions
fftw-3.3.4-fma-sse2-avx
clang -stdlib=libc++ -mmacosx-version-min=10.7 -march=core2 -integrated-as -m64 -I/usr/local/include
Ah, is this another case where CFLAGS from the buildbot environment are resulting in no -O flags getting sent to dependency libraries? cc @staticfloat I'd really like to get rid of all your profile environment variables from the buildbots, since these can be tough to track down.
Ouch. Yes, FFTW stores the compiler and flags, and they can be retrieved by the functions global variables fftw_cc and fftw_codelet_optim, both of which return const char*.
(cc is the compiler and default flags used for most of FFTW, and codelet_optim is any changes to the flags for the computational kernels.)
On MacOS with Julia master built from source, I get
julia> unsafe_string(cglobal((:fftw_cc, FFTW.libfftw), UInt8))
"clang -stdlib=libc++ -mmacosx-version-min=10.7 -m64 -O3 -fomit-frame-pointer -mtune=native -fstrict-aliasing -fno-schedule-insns -ffast-math"
julia> unsafe_string(cglobal((:fftw_codelet_optim, FFTW.libfftw), UInt8))
""
which looks okay.
With the official Julia 0.4.6 binary on MacOS, I get
julia> bytestring(cglobal((:fftw_cc, FFTW.libfftw), UInt8))
"clang -stdlib=libc++ -mmacosx-version-min=10.7 -march=core2 -integrated-as -m64 -I/usr/local/include"
julia> bytestring(cglobal((:fftw_codelet_optim, FFTW.libfftw), UInt8))
""
which looks like zero compiler optimizations, which will definitely hurt performance.
This is the same issue as #17751 (comment). The mac buildbot has a whole bunch of profile scripts that set CFLAGS and a number of other environment variables that are overriding things. It's potentially a fixable bug in various upstream build systems as in JuliaMath/openlibm#142, but that's way messier to do for all of our upstreams than to remove the profile scripts.
(Here is how FFTW picks its cflags: https://github.com/FFTW/fftw3/blob/master/m4/ax_cc_maxopt.m4)
Though the lack of optimization flags due to buildbot misconfiguration is a separate issue than the current title of
Enable FFTW threading by default
it is contributing to the difference in performance even when using the same number of threads.
So a second issue should be created, like "Lack of optimization of FFTW due to buildbot misconfiguration"
Or just change the title of this issue?
Enable FFTW threading by default
Is a valid, but separate, issue to the buildbot flags problem. The latter should now be resolved, I believe.
It is not only MacOS.
Both Win64 and generic Linux precompiled binaries (version 0.4.5) return:
julia> bytestring(cglobal((:fftw_codelet_optim, FFTW.libfftw), UInt8))
""
@GaborOszlanyi, that just means that the codelets are compiled with the same flags as the rest of FFTW, which is not necessarily a problem. Look at bytestring(cglobal((:fftw_cc, FFTW.libfftw), UInt8)).
fftw_cc is OK in both cases.
Sorry for the misunderstanding.
I still would like to create a second issue, because this are two different issues:
a) enabling multi threading by default; this needs a decision
b) Lack of optimization of FFTW due to buildbot misconfiguration; just needs to be fixed, and perhaps backported to 0.4 and 0.5
This are two different issues. If nobody opposes my proposal, I open a second issue.
@ufechner7 the second issue was already fixed, and does not require any changes in this repository.
So issue b) this will be fixed in the next binary releases of 0.4 and 0.5? That would be nice.
You write, that fixing the buildbots "does not require any changes in this repository". Is there a separate repository for the buildbots?
Yep, the main one is wittingly named julia-buildbot
issue b) this will be fixed in the next binary releases of 0.4 and 0.5
Should be, if the fix was complete and correct. Once we resolve #18079 it should also be testable with 0.6-dev nightlies.
As of yesterday the 0.4.6 Linux 64 binaries (julia-2e358ce975) shipped with the slow fftw libraries, which looked the same as the ones in 0.5 binaries.
As I noted before, the binaries from 0.4.5 (2ac304d) are good, using those in 0.4.6 buys you a factor 5 speedup.
Cheers!
@tkelman Is this something we can fix in 0.5.x?
Are you asking about multithreading, or are you asking about the other issue which now has its own #18245?
Enabling threading by default is a bit much of a behavior change to backport I think.
I thought #18245 was referring here for the fix for single-threaded perf.
I don't think we should backport threading by default, but we should probably do it on master sooner rather than later for 0.6.
Let's keep the discussions separate from now on. This issue is titled
Enable FFTW threading by default
and should stay focused on that going forward if we can.
Just a side question to @tkelman out of curiosity, will FFTW move to a package in favor of other FFT implementation like MKL? is it related to the GPL of FFTW? thanks ๐
This issue should be reopened on FFTW.jl
FFTW integration with julia's partr threads: JuliaMath/FFTW.jl#105