blas: Existence of complex blas?
Closed this issue · 18 comments
Hi, is there a plan to implement complex versions of the level 1 through 3 blas functionality? I guess both a complex conjugate and hermition operation would need to be implemented.
Thoughts on this?
I believe the eventual goal is to support complex matrices (in what would be matrix/mat128 or something). When this happens, having a complex128 BLAS is desirable. However, I have no timeline in mind for this (not sure about others). I imagine we would accept PRs for Complex128 implementations in native (assuming they were well implemented and had tests). Any work on (object-oriented) complex matrices should be discussed on gonum-dev first.
Please note that the C-wrappers to the complex BLAS are functional.
A sensible approach would be to leverage the effort @btracey has put into the native float64 implementation. To this end, the most desirable input would be to a code gen tool that works on native to generate the complex code from the float code.
Agreed. It would be especially nice in that the existing code is likely to be improved (for example, concurrency), and being able to modify complex128 with the new change automatically would be an important part of the development story.
Okay, that sounds reasonable. Historically there are separate instances of
parallel versus sequential implementations (PBLAS for blas and ScaLAPACK
for LAPACK). So an incremental process would be to do a sequential
implementation. Then do a concurrent implementation with a flag that
checks to see if the maximum number of go routines for a user is > 1. If
so, then automatically switches to the concurrent implementation.
On Mon, Feb 2, 2015 at 4:35 PM, Brendan Tracey notifications@github.com
wrote:
Agreed. It would be especially nice in that the existing code is likely to
be improved (for example, concurrency), and being able to modify complex128
with the new change automatically would be an important part of the
development story.—
Reply to this email directly or view it on GitHub
#106 (comment).
ScaLAPACK is designed for use with MPI. In order to use ScaLAPACK a special compiler is needed, and the program needs to be designed with the single-source, multiple-run (or whatever the right terminology is) in mind. Go already has built-in communication primitives that are very cheap. native.Dgemm is already implemented concurrenty. Of course, the Go version only works on shared memory machines, whereas ScaLAPACK can work on large clusters. Distributed memory computation is outside the scope of gonum (though of course is possible, and our current implementations provide a basis for that).
Agreed. ScaLAPACK was just an example. Basically my concern is that if
there are go routines being called, and the user has a single core, single
threaded processor, they will pay the overhead of calling the go routines
without any benefit.
Of course, with todays computers being what they are, this may not be of
concern.
On Mon, Feb 2, 2015 at 5:03 PM, Brendan Tracey notifications@github.com
wrote:
ScaLAPACK is designed for use with MPI. In order to use ScaLAPACK a
special compiler is needed, and the program needs to be designed with the
single-source, multiple-run (or whatever the right terminology is) in mind.
Go already has built-in communication primitives that are very cheap.
native.Dgemm is already implemented concurrenty. Of course, the Go version
only works on shared memory machines, whereas ScaLAPACK can work on large
clusters. Distributed memory computation is outside the scope of gonum
(though of course is possible, and our current implementations provide a
basis for that).—
Reply to this email directly or view it on GitHub
#106 (comment).
Nope. If you look at the implementation for Dgemm the number of workers is equal to GOMAXPROCS.
Okay GREAT! This is what I was hoping for. Sorry, I haven't got to Dgemm
yet...
On Tue, Feb 3, 2015 at 10:38 AM, Brendan Tracey notifications@github.com
wrote:
Nope. If you look at the implementation for Dgemm the number of workers is
equal to GOMAXPROCS.—
Reply to this email directly or view it on GitHub
#106 (comment).
@btracey, we might want to have a blas global that is used to define parallelism in dgemm. Something that is ignored when it's zero. What do you think?
Right now we allocate one goroutine per GOMAXPROCS, and perform a block-based matrix multiplication. From what I've read, this block-based approach is faster than the naive algorithm even in serial. If Issue 9129 is fixed, then cache-oblivious algorithms will be supported, and at least on Dmitry's blog, this implementation is also faster in serial than the naive algorithm. At that point, I'm not sure what such a global would gain us, although I suppose we could implement a go-routined recursive and a non-goroutined recursive version.
Clients may not want to allocate all GOMAXPROCS threads to Dgemm. The global would allow that.
That I understand. The accelerate framework on OSX provides that ability, though it's limiting the number of threads, which are more expensive than goroutines.
If the global flag is a binary on concurrency (Allow/Don't allow), then this is something we can implement. With the current implementation of Dgemm, we can also respect such a limit by limiting the number of workers. However, it seems like the future is for gc to support efficient cache-free parallel code. If we move to such an implementation (you can see a sketch of one https://groups.google.com/forum/#!searchin/golang-nuts/matrix$20multiplication/golang-nuts/RVX-BhHcugM/Pi6Jlj2OPNgJ ), then we either cannot enforce such a limit, or we'll hurt the scalability by introducing a central mechanism. Additionally, as programs grow in size, it seems like such a limit is meaningless in practice (there may be all sorts of concurrent calls below you). As the user, all you can really trust is that the go scheduler will try its best.
That sketch has exactly a threshold on the number of workers.
No, Threshold specifies the smallest matrix you divide. The base case for the recursion is when all of the dimensions are small (< Threshold) so you just do the matrix multiplication.
Yes, but you can use a threshold on recursion depth for concurrent operation.
That's true.
It would be interesting to try that for Dgemm. I tried the cache-free version, but it was way slower than the implementation we have now. I did not try it with a recursion depth limit, which would reduce the cost significantly.
Closing because this repository is no longer maintained. Development has moved to https://github.com/gonum/gonum.
Successor issues in the new repository:
gonum/gonum#184
gonum/gonum#185