ABI conflicts due to 64-bit libopenblas.so
Closed this issue ยท 106 comments
Julia compiles OpenBLAS to libopenblas.so
. This may be a problem for calling libraries that link to a system libopenblas.so
, because the runtime linker may substitute Julia's version instead. The problem is that Julia's version is compiled with a 64-bit interface, which is not the default, and so if an external library calls it expecting a 32-bit interface, a crash may result.
We encountered what appears to have been this problem n @alanedelman's machine (julia.mit.edu). He recently started experiencing crashes in PyPlot.plot
that, with the help of valgrind, I tracked down to apparently:
==17855== Use of uninitialised value of size 8
==17855== at 0xA8B6890: dgemm_beta_NEHALEM (in /home/edelman/julia/usr/lib/libopenblas.so)
==17855== by 0xA082D72: dgemm_nn (in /home/edelman/julia/usr/lib/libopenblas.so)
==17855== by 0x9F558C8: cblas_dgemm (in /home/edelman/julia/usr/lib/libopenblas.so)
==17855== by 0x16430CA5: dotblas_matrixproduct (_dotblas.c:809)
==17855== by 0x14BAB5D4: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
Apparently, Matplotlib is calling OpenBLAS (via NumPy: _dotblas.c
is a NumPy file) with the 32-bit interface, but is getting linked at runtime into Julia's openblas library, which is compiled with a 64-bit interface. Recompiling Julia and openblas with USE_BLAS64=0
worked around the problem, but it would be better to avoid the conflict.
Can we just rename our libopenblas.so
file to avoid any possible conflict in the runtime linker?
Or is the problem worse than that? If I ccall
a library that in turn calls cblas_dgemm
, will it end up calling our OpenBLAS version even if it was originally linked to a completely different BLAS library (e.g. libblas.so
)?
In that case, we might have to hack OpenBLAS to rename its exported functions (e.g. cblas_dgemm64
etcetera) since we changed the ABI.
@xianyi, is there a way to tell OpenBLAS to add a prefix or suffix (e.g. 64
) to all its exported symbols, to make it possible to link both the 32-bit and 64-bit ABI in the same executable?
See also numpy/numpy#3916
Wouldn't it make more sense to put the 64
after the cblas part โ as in cblas64_dgemm
?
The ideal solution would be to have a separate 64-bit ABI and build both 32 and 64 bit versions in the same library.
@ViralBShah that is actually the best solution here. That would be wonderful!
@StefanKarpinski, note that there is a Fortran dgemm
ABI too, and to avoid conflicts you need to rename both C and Fortran (unless we are not linking the Fortran ABI?). But I don't think it really matters what the name looks like, as long as there is a simple deterministic rule and it can be implemented as automatically as possible in the openblas source code. I was just thinking that a suffix might be easier to automate for both C and Fortran ABIs.
Currently we use the fortran abi only.
I wonder if we can somehow make matplotlib use its own blas. While we may be able to do all sorts of gymnastics with openblas, it will be difficult to do the same with vendor provided BLAS.
The other alternative would be to recompile our own numpy, but that makes installing PyCall much more of pain.
@ViralBShah, does MKL provide the 64-bit ABI?
The other alternative would be to recompile our own numpy, but that makes installing PyCall much more of pain.
Not to mention that the amount of stuff we compile ourselves is getting slightly ridiculous. But it's hard to avoid.
I believe MKL does have a 64-bit ABI - but not 100% sure. @andreasnoackjensen ?
I thought about recompiling numpy, but that is even more inconvenient.
I am not sure what exactly ABI mean, but MKL has 32 bit integers in the *lp64 libraries and 64 bit integers in the *ilp64 libraries. The symbols have the same names.
It's easy to add a prefix or suffix for 64-bit (ilp64) ABI. However, I am not sure OpenBLAS can support lp64 and ilp64 in one binary.
For MKL, you need link the application with different interface layer library, e.g. libmkl_intel_lp64.so or libmkl_intel_ilp64.so.
I think adding a prefix or suffix to the ilp64
OpenBLAS interface would already be a big help. @xianyi, assuming that such a suffix were added, what would go wrong if both the 32- and 64-bit OpenBLAS libraries were linked simultaneously?
Would naming the 64-bit version something like libopenblas_ilp64.so
solve this?
@ViralBShah, I'm not sure, but I doubt it. If you load two shared libraries which export the same symbol (e.g. dgemm_
) but with a different ABI, aren't there still going to be conflicts even if the libraries have different names? (At least if the libraries are loaded with RTLD_GLOBAL
?)
The easier thing then for now would be to just use the 32-bit version of openblas with IJulia, if that works.
Nassty 32-bit limits, we hates them forever!
Anyway, it's not just IJulia, since PyCall and Numpy can be used anywhere. And 32-bit vector size limits cause their own problems.
+1, we ran into a very similar issue here too: jump-dev/Ipopt.jl#1 (comment)
This was an instance of (here dcopy_
instead of cblas_dgemm
, but same idea)
If I
ccall
a library that in turn callscblas_dgemm
, will it end up calling our OpenBLAS version even if it was originally linked to a completely different BLAS library (e.g.libblas.so
)?
Any library linking to any LP64 shared library Blas/Lapack/etc can run into name shadowing and segfaults or other incorrect behavior when ccall
ed by Julia due to ILP64 openblas. Statically linking LP64 reference blas/lapack into the dependency library solves the issue in the case of Ipopt, but is not an ideal solution.
Since #5291 was merged there are now a handful of calls to cblas functions, otherwise I was going to suggest we could try co-opting OpenBlas' mechanism for handling trailing underscores as a potential way of attempting this.
We could always just patch the openblas source with a global s/cblas/jl_cblas/ substitution.
Isn't this mostly a visibility issue? Can we restrict openblas's symbols to not be visible to dlopen'ed shared libraries?
@mlubin, you're right that this would be the simplest option, if we can do it on all the relevant platforms. Is there a magic linker flag for this (analogous to RTLD_LOCAL
in dlopen
)?
Looks like if you want to avoid patching you need to use a linker script.
@pao, it looks like the link you found is for preventing some symbols from being exported at all. That's not what we want here. We want to export symbols to Julia, but not re-export them to other shared libraries.
Ah, sorry, I didn't catch that subtlety from @mlubin's comment; I see it now. I'm not deep enough on visibility to know whether that's even possible, though a cursory search didn't turn anything up.
This looks relevant. Some combination of -Bsymbolic
or -Bsymbolic-functions
, and/or creating wrappers ourselves with a prefix/suffix on the function names may work, if OpenBlas' build system can't easily be made to do what we want.
We could always just patch the openblas source with a global s/cblas/jl_cblas/ substitution.
If only. OpenBlas is full of preprocessor defines (and some perl? https://github.com/xianyi/OpenBLAS/blob/develop/exports/gensymbol looks promising) that obfuscate function naming (in particular NAME
and CNAME
), I'm having a hard time figuring out how it works.
Aha, looks like https://github.com/xianyi/OpenBLAS/blob/develop/Makefile.system#L776 is where NAME
and CNAME
are getting set.
I was just discussing this with @jiahao, and the easiest solution seems to be to use the GNU objcopy utility to just add a prefix jl_
to all exported symbols from libopenblas
after it is compiled.
That way, we don't need to hack the OpenBLAS source.
The only downside is that using Julia with MKL might be a pain, but there are probably ways around this with a @blas
macro to generate the ccall
s with or without the prefix.
๐ that sounds easier - would renaming dgemm_
to jl_dgemm_
then cause a problem for any Lapack routines that try to call dgemm_
, or would objcopy fix the reference too?
there are probably ways around this with a
@blas
macro to generate theccalls
with or without the prefix
See also #2167 (will be needed if anyone ever wants to use MKL on Windows or Intel Fortran anywhere) and #4290. It's not very well-documented, but Matlab lets you switch Blas and Lapack via environment variables. Putting that runtime-switching (or startup, or sysimg-build-time) abstraction layer into Julia will be useful as long as it doesn't introduce a noticeable performance penalty.
I don't think runtime switching will be possible since MKL's libraries would not have the jl_
prefixes that the compiled Julia wrapper functions would be conditioned to expect.
@tkelman, objcopy
will rename both the exported symbols and all references to them within the object code, so BLAS calls within LAPACK should not be a problem since libopenblas
includes both LAPACK and BLAS. (I just double-checked this. It pretty much has to work this way, of course, for symbol renaming to be usable.)
Another likely instance of this: https://github.com/lruthotto/MUMPS.jl/issues/2
Having to rebuild the system image to change Julia's Blas backend wouldn't be too bad.
The number of library wrapper packages that depend on Blas and Lapack is already pretty high and will continue to grow. Most of these libraries should have decent facilities for configuring them with different Blas libraries at compile time. It'll be good to standardize an approach for providing a Blas library from Julia to library packages, for performance, reducing duplication, and cross-platform uniformity (no such thing as "system Blas" on Windows, and we want our library packages to work on Windows don't we?). The LP64 vs ILP64 issue is part of this, and it may require providing an LP64 Blas library with the default function names for packages, while Julia itself uses an ILP64 Blas with prefixed function names.
So is "using the GNU objcopy utility to just add a prefix jl_ to all exported symbols from libopenblas after it is compiled" a good solution? If so, what needs to be done to make it work?
@ufechner7, two things (a) the Makefile needs to be updated to make the requisite call to objcopy
and (b) base/linalg/blas.jl
etcetera need to be updated to change all ccall
s to BLAS and LAPACK routines with e.g. a @blascall(...)
macro that prepends the jl_
prefix to the symbol (we want a macro here so that it can be easily changed, e.g. to call MKL).
Did anyone start experimenting with this to see how feasible it is?
Not yet, as far as I know. I only tried out objcopy
to verify that it could rename the symbols.
I tried cp libopenblas.so libjlopenblas.so; objcopy --prefix-symbols=jl_ libjlopenblas.so
then
julia> n = 5; a = rand(n); b = rand(n); inca = 1; incb = 1;
julia> y = ccall((:jl_ddot_, "libjlopenblas"), Float64, (Ptr{Int}, Ptr{Float64}, Ptr{Int}, Ptr{Float64}, Ptr{Int}), &n, a, &inca, b, &incb)
ERROR: ccall: could not find function jl_ddot_ in library libjlopenblas
in anonymous at no file
So something's missing. nm libjlopenblas.so | grep ddot
does return the expected
00000000000f47b0 T jl_cblas_ddot
00000000000f3aa0 T jl_ddot_
0000000000f29200 T jl_ddot_k_ATOM
0000000000c1ce00 T jl_ddot_k_BARCELONA
0000000000dbf200 T jl_ddot_k_BOBCAT
0000000001299e00 T jl_ddot_k_BULLDOZER
00000000004b4e00 T jl_ddot_k_CORE2
0000000000703a00 T jl_ddot_k_DUNNINGTON
0000000001013400 T jl_ddot_k_NANO
0000000000808c00 T jl_ddot_k_NEHALEM
0000000000932800 T jl_ddot_k_OPTERON
0000000000aa7e00 T jl_ddot_k_OPTERON_SSE3
00000000005de000 T jl_ddot_k_PENRYN
00000000013d8000 T jl_ddot_k_PILEDRIVER
0000000000320600 T jl_ddot_k_PRESCOTT
000000000113c200 T jl_ddot_k_SANDYBRIDGE
so maybe some additional steps are required?
On Windows there is a not-that-hard option that works, by making the following change to this file in OpenBLAS
--- exports/gensymbol 2014-08-11 20:56:12.014049400 -0700
+++ exports/jl_gensymbol 2014-08-11 20:55:22.566221200 -0700
@@ -2833,22 +2833,22 @@
foreach $objs (@underscore_objs) {
$uppercase = $objs;
$uppercase =~ tr/[a-z]/[A-Z]/;
- print "\t$objs=$objs","_ \@", $count, "\n";
+ print "\tjl_$objs=$objs","_ \@", $count, "\n";
$count ++;
- print "\t",$objs, "_=$objs","_ \@", $count, "\n";
+ print "\tjl_",$objs, "_=$objs","_ \@", $count, "\n";
$count ++;
- print "\t$uppercase=$objs", "_ \@", $count, "\n";
+ print "\tjl_$uppercase=$objs", "_ \@", $count, "\n";
$count ++;
}
foreach $objs (@need_2underscore_objs) {
$uppercase = $objs;
$uppercase =~ tr/[a-z]/[A-Z]/;
- print "\t$objs=$objs","__ \@", $count, "\n";
+ print "\tjl_$objs=$objs","__ \@", $count, "\n";
$count ++;
- print "\t",$objs, "__=$objs","__ \@", $count, "\n";
+ print "\tjl_",$objs, "__=$objs","__ \@", $count, "\n";
$count ++;
- print "\t$uppercase=$objs", "__ \@", $count, "\n";
+ print "\tjl_$uppercase=$objs", "__ \@", $count, "\n";
$count ++;
}
@@ -2857,15 +2857,15 @@
$uppercase = $objs;
$uppercase =~ tr/[a-z]/[A-Z]/;
- print "\t",$objs, "_=$objs","_ \@", $count, "\n";
+ print "\tjl_",$objs, "_=$objs","_ \@", $count, "\n";
$count ++;
- print "\t$uppercase=$objs", "_ \@", $count, "\n";
+ print "\tjl_$uppercase=$objs", "_ \@", $count, "\n";
$count ++;
}
foreach $objs (@no_underscore_objs) {
- print "\t",$objs,"=$objs"," \@", $count, "\n";
+ print "\tjl_",$objs,"=$objs"," \@", $count, "\n";
$count ++;
}
My ccall test with a prefixed jl_ddot_
works with a libopenblas.dll generated based on this modification.
@tkelman, does that rename all of the functions or just the generated ones? e.g. we also want to rename functions like openblas_set_num_threads
.
@stevengj it renames everything that's exported from the dll, including openblas_set_num_threads
.
I figured out why objcopy
isn't working. It evidently can't rename dynamic symbols, unless it has learned some new tricks since http://sourceware-org.1504.n7.nabble.com/objcopy-redefine-sym-on-dynsym-section-td119610.html
[tkelman@static-host lib]$ objdump -T libjlopenblas.so | grep ddot
0000000000dbf200 g DF .text 0000000000000591 Base ddot_k_BOBCAT
0000000000aa7e00 g DF .text 0000000000000569 Base ddot_k_OPTERON_SSE3
00000000005de000 g DF .text 0000000000000559 Base ddot_k_PENRYN
0000000001299e00 g DF .text 0000000000000341 Base ddot_k_BULLDOZER
00000000004b4e00 g DF .text 0000000000000551 Base ddot_k_CORE2
0000000000f29200 g DF .text 0000000000000325 Base ddot_k_ATOM
0000000000320600 g DF .text 0000000000000581 Base ddot_k_PRESCOTT
0000000000808c00 g DF .text 0000000000000591 Base ddot_k_NEHALEM
0000000000703a00 g DF .text 0000000000000529 Base ddot_k_DUNNINGTON
00000000000f3aa0 g DF .text 000000000000005d Base ddot_
0000000000932800 g DF .text 000000000000056e Base ddot_k_OPTERON
0000000001013400 g DF .text 0000000000000591 Base ddot_k_NANO
00000000013d8000 g DF .text 0000000000000341 Base ddot_k_PILEDRIVER
0000000000c1ce00 g DF .text 0000000000000591 Base ddot_k_BARCELONA
000000000113c200 g DF .text 0000000000000591 Base ddot_k_SANDYBRIDGE
00000000000f47b0 g DF .text 0000000000000055 Base cblas_ddot
Anyone have any suggestions? I tried messing with some of the CNAME
definitions in OpenBLAS' Makefile.system but that led to several undefined symbols, a bad mix of renamed and not-renamed functions. @xianyi any suggestions for applying a global prefix (or suffix, if that's easier) to all functions exported from the openblas shared library, on Linux and OSX?
Would loading with RTLD_LOCAL help?
@nbecker, this was discussed above. One obstacle to RTLD_LOCAL
seems to be that we are not loading OpenBLAS with dlopen
, but are rather linking libopenblas.so
directly to the julia
executable, so we have to figure out if there is a corresponding linker flag. I did I quick search through the man page of GNU ld
and didn't see anything, but it has a zillion options and it's possible I missed something.
(This problem mainly seems to show up on GNU/Linux, so I think we need something that works with GNU ld
.)
@stevengj I believe we are dlopen
'ing OpenBLAS, albeit implicitly just by ccall
'ing some BLAS function and passing Base.libblas_name
in as the library handle. We could probably explicitly dlopen
libblas in an initialization function somewhere and pass in RTLD_LOCAL
if we want to.
It's definitely been a problem in packages on Macs too. There's an osx.def
file in OpenBLAS which gets created by the same Perl script gensymbol
then linked using -Wl,-exported_symbols_list,osx.def
, I can't really test that though as I don't have a Mac.
I think I found a solution. We can't use objcopy
on the shared library because it can't rename dynamic symbols, but I just tried it on the static library right before linking the .so and that works. It passes my jl_ddot_
test, anyway:
--- exports/Makefile-old 2014-08-20 20:47:51.000000000 -0700
+++ exports/Makefile 2014-08-20 20:45:16.000000000 -0700
@@ -103,7 +103,10 @@
so : ../$(LIBSONAME)
-../$(LIBSONAME) : ../$(LIBNAME) linktest.c
+../$(LIBSONAME) : ../$(LIBNAME) linktest.c aix.def
+ rm -f prefix.def
+ for i in `cat aix.def`; do echo "$$i jl_$$i" >> prefix.def; done
+ objcopy --redefine-syms prefix.def ../$(LIBNAME)
ifneq ($(C_COMPILER), LSB)
$(CC) $(CFLAGS) $(LDFLAGS) -shared -o ../$(LIBSONAME) \
-Wl,--whole-archive ../$(LIBNAME) -Wl,--no-whole-archive \
I'm using aix.def
as a simple list of exported symbols. objcopy --prefix-symbols=jl_ ../$(LIBNAME)
went a little overboard renaming everything in the static library (including things from libm, pthreads, libgfortran, etc), it couldn't link the .so from it afterwards.
Great!
If it is true that libopenblas is linked via dlopen (and I believe that is
a correct statement), then in my opinion using RTLD_LOCAL is a lot cleaner
solution.
On Thu, Aug 21, 2014 at 7:59 AM, Steven G. Johnson <notifications@github.com
wrote:
Great!
โ
Reply to this email directly or view it on GitHub
#4923 (comment).
We don't have an explicit call to dlopen
in Julia. My recollection was that @JeffBezanson wanted to avoid replacing ccall
s with explicit dlopen
calls in order to ease eventual static compilation. Jeff, do you have an opinion here?
Well here's something to think about: if openblas were statically compiled into julia, is it possible to hide the symbols like with RTLD_LOCAL
? If not, then there's really no choice but to rename the symbols.
Maybe this ld option?
--exclude-libs lib,lib,...
Specifies a list of archive libraries from which symbols should
not be automatically exported.
The library names may be delimited by commas or colons.
Specifying "--exclude-libs ALL"
excludes symbols in all archive libraries from automatic export.
This option is available
only for the i386 PE targeted port of the linker and for ELF
targeted ports. For i386 PE,
symbols explicitly listed in a .def file are still exported,
regardless of this option. For
ELF targeted ports, symbols affected by this option will be
treated as hidden.
On Thu, Aug 21, 2014 at 11:48 AM, Miles Lubin notifications@github.com
wrote:
Well here's something to think about: if openblas were statically compiled
into julia, is it possible to hide the symbols like with RTLD_LOCAL? If
not, then there's really no choice but to rename the symbols.โ
Reply to this email directly or view it on GitHub
#4923 (comment).
A brief look at the src code, it looks like openblas is loaded via ccall.
I was thinking perhaps an optional flag to ccall to pass RTLD_LOCAL?
On Thu, Aug 21, 2014 at 11:59 AM, Neal Becker ndbecker2@gmail.com wrote:
Maybe this ld option?
--exclude-libs lib,lib,... Specifies a list of archive libraries from which symbols should
not be automatically exported.
The library names may be delimited by commas or colons.
Specifying "--exclude-libs ALL"
excludes symbols in all archive libraries from automatic
export. This option is available
only for the i386 PE targeted port of the linker and for ELF
targeted ports. For i386 PE,
symbols explicitly listed in a .def file are still exported,
regardless of this option. For
ELF targeted ports, symbols affected by this option will be
treated as hidden.On Thu, Aug 21, 2014 at 11:48 AM, Miles Lubin notifications@github.com
wrote:Well here's something to think about: if openblas were statically
compiled into julia, is it possible to hide the symbols like with
RTLD_LOCAL? If not, then there's really no choice but to rename the
symbols.โ
Reply to this email directly or view it on GitHub
#4923 (comment).
If Jeff wants to cut down on dlopen
and just use ccall
to implicitly open them, I believe that's just so that when we do have the infrastructure, we can forgo dlopen
ing at all, as everything will be statically linked together, and the dlopen
call itself could fail.
In that case, one could imagine special flags to be passed to dlopen()
to hint to the static compiler that this dlopen()
call can be ignored during static compilation, or even better, to hint to the static compiler that the symbols being imported from this library should not be exported! Since during static compilation we could reintroduce the problem without that knowledge. In either case, until we get static compilation and know the requirement, I don't think we should take using dlopen
off the table. (Unless of course, Jeff shows up in this thread and proves me wrong!)
@nbecker, I think the --exclude-libs
might prevent us from calling the functions dynamically (hence from Julia) at all, unless each and every ccall
to BLAS is compiled statically.
I mentioned --exclude-libs to address static linking, which I thought Miles
Lubin had suggested.
That is, I thought the suggestion was to build julia statically linking to
libopenblas.a. In that case, it sounded like --exclude-libs might be
useful.
On Thu, Aug 21, 2014 at 12:18 PM, Steven G. Johnson <
notifications@github.com> wrote:
@nbecker https://github.com/nbecker, I think the --exclude-libs will
prevent us from calling the functions dynamically (i.e. from Julia) at all.โ
Reply to this email directly or view it on GitHub
#4923 (comment).
Statically linking openblas will make the julia binaries huge, I don't think that's a serious option. I'd prefer doing something that works the same way across all platforms. Using 64-bit integers in openblas by default was a bit of a cavalier choice in terms of compatibility with other libraries, and unless we want to reverse that choice we need to do something to mitigate the compatibility problems. (Yes Matlab made the same choice, but ask anyone who builds mex files that depend on blas, this same issue is a big problem there too.)
So can someone test if tweaking gensymbol
and/or osx.def
to add prefixes on the exported symbols works on Mac too?
@tkelman On OSX, the generated .def
file isn't a mapping, it's just a list of symbols. So changing the list of symbols doesn't change much, unfortunately. Unless I'm misunderstanding something.
Oh. carp. Can you try a similar patch to the above #4923 (comment), but on these OSX lines instead? https://github.com/xianyi/OpenBLAS/blob/a69dd3fbc5c38f7098d1539a69963c0d2bd3163a/exports/Makefile#L96-L97
I'm not sure whether you should use osx.def
as a base (with the leading underscore on everything), or aix.def
(without leading underscores).
We don't have objcopy
on OSX. :P Le sigh.
On the plus side, I found a really neat tiny utility called objconv that I think will make our lives easier. It can manipulate PE, ELF and Mach-O, and it even has a method to replace the prefix of all symbols with a different prefix. After compiling it, (which was a refreshing exercise in simplicity), I applied this patch to OpenBLAS:
diff --git a/exports/Makefile b/exports/Makefile
index c798bc7..08f413a 100644
--- a/exports/Makefile
+++ b/exports/Makefile
@@ -93,8 +93,18 @@ libopenblas.def : gensymbol
libgoto_hpl.def : gensymbol
perl ./gensymbol win2khpl $(ARCH) dummy $(EXPRECISION) $(NO_CBLAS) $(NO_LAPACK) $(NO_LAPACKE) $(NEED2UNDERSCORES) $(ONLY_CBLAS)
-$(LIBDYNNAME) : ../$(LIBNAME) osx.def
- $(FC) $(FFLAGS) -all_load -headerpad_max_install_names -install_name $(CURDIR)/../$(LIBDYNNAME) -dynamiclib -o ../$(LIBDYNNAME)
+../$(LIBNAME).patched: ../$(LIBNAME) osx.def
+ # Build parameter file for objconv
+ rm -f objconf.params
+ for i in `cat osx.def`; do \
+ echo "-nr:$$i:_jl$$i" >> objconf.params; \
+ done
+ objconv @objconf.params ../$(LIBNAME) ../$(LIBNAME).patched
+
+$(LIBDYNNAME) : ../$(LIBNAME).patched osx.def
+ # We want to avoid the LAPACK symbols stuff
+ sed -e 's/.*/_jl&/' osx.def | grep -v LAPACK > osx.def.patched
+ $(FC) $(FFLAGS) -all_load -headerpad_max_install_names -install_name $(CURDIR)/../$(LIBDYNNAME) -dynamiclib -o ../$(LIBDYNNAME)
dllinit.$(SUFFIX) : dllinit.c
$(CC) $(CFLAGS) -c -o $(@F) -s $<
That seems workable; I guess objconv
could be added to deps
as a build dependency. (It is GPL, but that is irrelevant here since we aren't actually linking objconv
into Julia, just using its output.)
Yeah, it's just like patchelf
.
Okay, so have we determined the way to move forward here?
- Incorporate
objconv
as an osx-only dependency - Patch openblas using some combination of the above snippets to add prefixes on all symbols in the BLAS64 case (I think we want to prefix even the lapacke stuff too - we may not be using those but someone will eventually want to ccall some library that does, expecting 32-bit-ints)
- Write a macro to prefix all blas and lapack symbols used in ccalls in Base, but only when we're using a 64-bit-int openblas that we know we built from source
That seems like the way to move forward here. We can use this trick for openlibm too.
Why osx-only? We need to rename the symbols on Linux too.
On Linux, we already have patchelf. So step 1 is done. What about windows?
Not patchelf, we use objcopy
here for Linux. On Windows it was sufficient to patch the gensymbol
perl script.
As if this thread was not complex enough yet, I'd like to add the use case of distribution packages to the list. :-)
On Fedora for example, ILP64 OpenBLAS is in a separate library called libopenblas64.so
. But there's little chance the symbols in this file will be added a prefix to distinguish them from their LP64 counterpart, as Julia is not the only user of that package. A standard prefix (like 64
) could be applied upstream, but then it would mean programs could not easily switch between OpenBLAS and MKL (not all languages have macros as flexible as Julia). A solution would be to build two versions of ILP64 OpenBLAS, one with standard names, and one with the prefix, so that all programs are happy, but this would entail a large amount of duplication (and there are already 2 x 3 copies of OpenBLAS, for 32/64-bit, and for serial, OpenMP and pthreads).
Admittedly, this is also upstream's and distributions' task to make sure ILP64 and LP64 libraries can happily cohabit. Since this issue does not only affect Julia, shouldn't something be done in coordination with upstream and distributors?
(That doesn't mean the fix suggested above isn't useful for other contexts.)
Are any distribution packages for Julia using ILP64 openblas? I wasn't aware that any distributions had ILP64 blas packages. Is there a way to do a reverse-dep search to get a rough survey, within the distributions that have ILP64 blas implementations packaged and available, what other client packages are making use of them?
More coordination absolutely makes sense. This is a major problem that cuts across multiple distributions, operating systems other than Linux, programming languages, and use cases. I don't think there's any sane way for ILP64 and LP64 to coexist that satisfies every possible combination here - it's almost impossible to know ahead of time that there will never be someone who wants to combine functionality from a library that decided to use ILP64 with a library that didn't. Unless you want to introduce the burden of requiring that every ILP64 library also provide a separate LP64 implementation (my guess is most of them are familiar enough with this issue that they already are, even if not required to...).
I've personally been starting to think that ILP64 BLAS is more trouble than it's worth. Once you get up to multiple gigabytes of data and you want to do dense linear algebra, even just BLAS1 (which can mostly be done equivalently in Julia anyway) on a huge vector, you're probably better off working in distributed memory and figuring out how to partition your data more sanely so you don't have to think about all of it at once. Is there such a crazy person doing BLAS2, BLAS3, or LAPACK in a single shared memory space with arrays whose dimensions are larger than 32 bits?
It is common for libraries to export their functions under multiple names. This is usually done via "weak symbols" or so, and does not require any code duplication. For example, name mangling for Fortran is not standardized, and many Fortran libraries export 2 or 3 different names for each function (e.g. "DGEMM", "dgemm_" and "dgemm__").
@tkelman ILP64 OpenBLAS has been added recently to Fedora, in part on my request. Apparently no package uses it yet, so it may still be time to fix things. I think other distributions do not provide it.
We can probably find a solution with @xianyi, but the problem is that MKL already seems to use identical symbols for LP64 and ILP64 (is that right?), and I guess it will be hard to get them to release an additional version with modified symbols. Though they may accept @eschnett's "weak symbols" solution if it's considered standard enough.
Regarding the need for ILP64, using it by default might not be the utmost priority, but I've seen on the Web several people requiring ILP64 BLAS, for example SuiteSparse's author Tim Davis here:
You might wonder if I would be insane enough to contemplate a matrix larger
than 2^32 by 2^32. I'm not. When using the BLAS in an unsymmetric sparse
factorization code, you can get very tall and thin (or short and squat) dense
submatrices, where just one of the dimensions m or n is larger than 2^31 (k,
in my case, is limited to a small constant in dgemm). The total problem size
could still be just a handful of GBytes (but more than 4GB), even if one of m
or n (but not both) in a call to dgemm is larger than 2^31.
The more important part of that quote (from a surprisingly long time ago, 2001) is this:
in the Sun Performance Library, there is a
64-bit routine:void dgemm_64
Someone at Sun was thinking ahead. Shame they probably weren't the same ones running the business side of things, but I digress.
MKL already seems to use identical symbols for LP64 and ILP64 (is that right?)
Looking at nm /opt/intel/composer_xe_2011_sp1.11.339/mkl/lib/intel64/libmkl_intel_ilp64.a | grep 64
, the Pardiso and FFTW symbol names have either _64
or _ilp64
suffixes on them, the typical blas/lapack symbols do not. (Edit: misread the output of nm
, Pardiso symbols have the same suffix in the _lp64.a
library - the object names also have suffixes, but only a few of the symbols do) However MKL is not included in any Linux distributions, the general assumption when you use MKL is that you have to recompile everything against MKL. Leading to ridiculous situations like this, in Python land: http://www.lfd.uci.edu/~gohlke/pythonlibs/
I don't know whether weak symbols will work cross-platform. What happens to the original names when you use weak symbols? If it's still possible to get name conflicts if the original names are still exported, I'm not sure if that approach helps.
Good to hear! So if Intel has already found a naming convention (or even two) for ILP64 MKL, then upstream OpenBLAS could simply use the same names, or advise packagers to do so -- either with weak symbols (if that works), or by making it easy to completely rename functions with a compile-time flag. @xianyi How do you feel about that?
No, I was misreading the output of nm
and I think you misinterpreted what I said. MKL does not have separate naming conventions, by and large. Sun did, but Sun has gone the way of the dodo - apparently you can still buy SunPerf from Oracle, but does anyone? http://docs.oracle.com/cd/E24457_01/html/E21987/gkezy.html. In a few small places MKL has naming conventions, but only in spblas
or other features that OpenBLAS does not provide. For dense blas and lapack, MKL allows you to switch between ILP64 and LP64 without having to change function names, only integer types (you'll need to recompile and re-link). So the symbols do conflict and you should absolutely never have an application (or more likely two separate unrelated parts that some other application wants to use in combination) that tries to use ILP64 and LP64 at the same time.
A compile-time flag for adding a prefix or suffix would be ideal, that's what we asked for, but it's looking like we have to figure out how to do it ourselves across all platforms we care about. We mostly have figured out a hacky way that should work, but it requires adding extra dependencies that only apply on a system that I don't have (why you gotta suck, osx binutils?) so it's tough for me to make much progress there.
Ah, sorry, I thought that by "typical" you meant LP64. So the problem (only considering the case of distribution packages, and even ignoring technical issues on OS X) is that a compile-time prefix would make ILP64 OpenBLAS completely incompatible with all other BLASes, meaning that very few programs will switch to it. Or it will force shipping two copies with different symbol names.
Let's ask @susilehtola, the maintainer of OpenBLAS in Fedora.
a compile-time prefix would make ILP64 OpenBLAS completely incompatible with all other BLASes
Yes, a compile-time prefix would introduce an API incompatibility in the function names, not just their types. So for pieces of software that are set up to easily switch their integer types (which is far from all pieces of software...) but not set up to easily change the function names with which they call blas, this would make it harder to use the ILP64 blas. However my argument is this is a feature not a bug, since changing the integer type without changing the function name introduces far more subtle ABI incompatibilities, which may only be exhibited at runtime by some entirely separate application.
Julia works just fine with ILP64 openblas when all integers are internally 64 bit, but then you try to call Ipopt from that same process if Ipopt is linked to a conventional shared-library LP64 blas? Segfaults. Same exact thing happens in Matlab which uses ILP64 MKL without changing the symbol names.
Or it will force shipping two copies with different symbol names.
I don't think this is a good idea. People who use an ILP64 blas should really know exactly what they're doing and be well aware that leaving the symbol names alone makes their library impossible to use in combination with a huge amount of pre-existing numerical code (unless things are carefully statically linked, which is not how Julia works).
Let's ask @susilehtola, the maintainer of OpenBLAS in Fedora.
Yes, getting more input would be useful, sorry for the walls of text.
Uhm.. What do you want to ask?
@susilehtola Sorry, the thread is quite long. The relevant part starts at #4923 (comment). Basically, we're wondering whether it would make sense to rename all symbols in the ILP64 OpenBLAS library (e.g. adding a 64
suffix), so that programs which link to it do not crash if they also use library which links to the LP64 version.
Duplicating system libraries is nasty stuff and forbidden in linux distributions.
So if you were to do that, it would be reverted at least in Fedora.
@susilehtola, if we linked with a system OpenBLAS, or some other BLAS implementation (e.g. MKL), we would not add the 64
suffix (via a macro in the Julia code). So, this wouldn't affect Fedora packages.
(For a Julia in Fedora distro and linked to the Fedora BLAS, this is not really an issue because presumably on that system all libraries are linked to the same BLAS. The problem arises when people have multiple ABI-incompatible BLAS implementations on the same machine, e.g. one from the julialang.org binary download and one from their distro, and then libraries get confused.)
I think this discussion has gotten a bit off-track: we really only need the suffix for the case when we are distributing/compiling an ILP64 OpenBLAS ourselves.
Well, the issue is still in distributions as well. For instance in Fedora you have reference BLAS/LAPACK, ATLAS and OpenBLAS, all of which ship the same symbols for API compatibility. And, for reference BLAS/LAPACK and OpenBLAS, also 64-bit interface versions exist.
All of these can cause unpredictable behavior and crashes if mixed together.
This is just an unfortunate issue with the numerical libraries. The calls to BLAS/LAPACK functions really should be translated at compile time to calls to implementation specific functions, as has been suggested above. But, this is really wishing for too much.
Is there some approach that you can suggest that would help to resolve this problem?
My feeling is that we should rename the symbols when making our own binaries, and not worry too much about Fedora etcetera (where we will use whatever ABI they want).
(We have to support both suffixed and non-suffixed ABIs anyway because people want to use MKL, so there will be a Makefile switch and a corresponding macro in Base
.)
Making sure that programs are linked to consistent BLAS libraries really seems like a distro issue to me. Fedora should make a decision about which BLAS ABI they want their scientific libraries (including NumPy) linked to, and be consistent. Then whatever choice they adopt can be used for their Julia package too.
@stevengj See https://fedorahosted.org/fpc/ticket/352 for a debate about how to handle the BLAS/LAPACK mess in Fedora. tl;dr: it's already hard enough to get everybody agree on a scheme for fully-compatible LP64 BLAS that I don't think it would be easy to move all packages to ILP64 (which would break ABI...). So I agree Julia should find a solution for when BLAS is bundled, while we try to find another solution for distributions.
@susilehtola Do you see any path forward if we intend to move as many libraries as possible to ILP64 in Fedora? And to make them cohabit without crashing?
for reference BLAS/LAPACK and OpenBLAS, also 64-bit interface versions exist
If Fedora is starting to distribute ILP64 openblas and/or reference blas and lapack, this ABI issue is a major problem and should be worked out earlier rather than later. If you aren't going to change the symbol names, in my opinion the responsible thing for distributions to do is to mark ILP64 blas/lapack (any implementation) as conflicting with LP64 blas/lapack, so they cannot be simultaneously installed.
if we intend to move as many libraries as possible to ILP64
I think that's overly optimistic. Julia's blas and lapack interface code is nicely modularized and easily configurable to use different integer sizes, but that's not the case everywhere (maybe for code that is 100% Fortran, but is there such a thing any more?). Ipopt for example interfaces to external code written by a variety of authors in C, C++, Fortran 77, and Fortran 90. I'm not aware of any demand (except perhaps from Julia) to add a configuration option for using ILP64 blas with Ipopt, certainly not to the extent of anyone stepping up to write the very long invasive patch that would require.
Has anybody considered using symbol versioning? It's precisely made to allow loading incompatible ABIs in the same process, without changing the symbol names at all. The default ABI version could be called lp64
or 32
, and another version would be ilp64
or 64
; thus, applications designed for the LP64 BLAS would work fine. Julia would use dlvsym(handle, symbol, "ilp64")
to get the interface it wants.
That would require building LP64 and ILP64 versions of OpenBLAS in the same library, not sure how hard that would be. One drawback is that only the GNU, BSD and Solaris linkers support symbol versioning.
@nalimilan symbol versioning might be an okay solution for linux distributions with this issue. How does Julia's ccall interact with symbol versioning though? And as I've mentioned this is an issue on Mac as well, and I'd prefer whatever choice we make in Julia to be as uniform across platforms as we can (despite differences in patches / build process to get there).
@nalimilan symbol versioning might be an okay solution for linux distributions with this issue. How does Julia's ccall interact with symbol versioning though?
I'm not very clear on how ccall
works, but as I said above it should be possible to request a non-default interface (here, ILP64) using dlvsym
.
And as I've mentioned this is an issue on Mac as well, and I'd prefer whatever choice we make in Julia to be as uniform across platforms as we can (despite differences in patches / build process to get there).
Yes, that would be much better, but so far I don't see a completely portable solution. The solution of adding 64
to all symbols could be used on all platforms as a fallback. And Linux distribution packages could use the symbol versioning approach: since it does not require adding the prefix to all function calls (just mentioning the ABI version you want once, for compiled calls), it will be much more suitable if other programs than Julia want to link to the ILP64 BLAS.
@susilehtola Any thoughts on this scenario?
I'm having a look at implementing this. objconv (http://www.agner.org/optimize/#objconv) apparently doesn't upload versioned source files, and it's in the wonderful format of a zip file within another zip file. The last change was just a couple weeks ago, so using an unversioned url would constantly flag checksum mismatches. Should we just rehost the source? (edit: nevermind, checksumming is nice to do when possible but it's not really mandatory) There might also be a way to use ld
to achieve something similar via aliases? See http://stackoverflow.com/a/11951756 - can someone with a mac try that out?
I also checked where SuiteSparse links to BLAS/LAPACK functions, and there's already a bit of code there for the 64-bit SunPerf BLAS with functions suffixed by _64
. See deps/SuiteSparse-4.3.1/CHOLMOD/Include/cholmod_blas.h
. Looks like if we adopt that suffix, or patch that section of code (also in umfpack and spqr) to use our own prefix/suffix, we can just set -DSUN64
(or patch it to use our own define) and SuiteSparse should work. Arpack will be much messier, unfortunately. How's the replacing arpack effort coming along?
@jiahao can say when we can replace ARPACK. We have added lots of functionality around ARPACK, and fixed all the issues. We will soon have svds
too, and a few other things that will make it feature complete.
It is not too tough to patch ARPACK to use 64-bit BLAS/LAPACK with a _64
suffix.
While I work on writing the patch for this, which approach would folks prefer?
- Suffix all symbols by
64_
, so we can use-DSUN64
without patching SuiteSparse, and make the macro Julia-side more uniform. - Suffix all symbols by
_64
, but before any trailing underscores in the function name. Can use SuiteSparse unpatched here, but the Julia-side macro would have to be a little more complicated. - Use a prefix like
ilp64_
orjl_
, would have to patch SuiteSparse for this.
In the absence of arguments in favor of 2 and 3, why not use the -DSUN64
convention? If it can help establishing a standard, it would be a good thing, and other projects are more likely to accept supporting this if Julia is not the only project using this convention.
The only argument in favor of 2 or 3 would be that cblas_ddot64_
or openblas_set_num_threads64_
look a little funny.
Yeah, weird idea... It's also true that searching for e.g. dgemv_64_
doesn't give results other than the SuiteSparse file, so it doesn't look so popular.
And people using ILP64 BLAS libraries from Fortran have to worry about compiler-dependent name mangling.
Anyway, I've got step 2 from #4923 (comment) mostly done, I think I'll post a WIP PR soon so people can look at it.
Since there is no technical reason to prefer one suffix over another as far as I can see, any little thing tips the scales, so I would go with the SUN64 convention.
A good read on usage of versioned elf shared libs:
http://www.akkadia.org/drepper/dsohowto.pdf
On Sun, Oct 19, 2014 at 10:56 AM, Steven G. Johnson <
notifications@github.com> wrote:
Since there is no technical reason to prefer one suffix over another as
far as I can see, any little thing tips the scales, so I would go with the
SUN64 convention.โ
Reply to this email directly or view it on GitHub
#4923 (comment).
Those who don't understand recursion are doomed to repeat it