Using Boost.Compute in clSPARSE
jpola opened this issue · 43 comments
From my experience in GPU based sparse algorithms there is need to use many advanced parallel algorithms (I usually call them parallel primitives) like reduce by key, sort by key, gather, scatter, scatter_if and other. As you saw my pull #135 brings some of them (initially provided by @shuaiche).
What is your opinion about introducing Bolt library in clSPARSE and use more mature version of its algorithms instead of rewriting them from Bolt? I see that introduction of Bolt have drawbacks but it can save us a lot of time.
I'm considering your question/request. Normally, I would say developers can get frustrated with too many dependencies in a shared library (issue #23 are my ramblings where I'm justifying boost to be an exception). However, with our superbuild infrastructure the extra dependency might not be so bad to manage for our users (I'm assuming we want to stay away from git-submodules). I only hesitate, because I don't yet have a lot of feedback from end users that they like and use our superbuild, but at the same time i don't want to throw away a productivity or performance boost. Perhaps both projects would benefit with clsparse being a client of Bolt(e.g. we could update device_vector class to support OpenCL 2.0 in Bolt). Btw, Bolt definitely brings in a dependency on Boost, as Bolt uses Boost to help it implement proper iterators, which brings our conversation full circle back to #23 ♻️ .
I would love @kvaragan, @shuaiche, @jlgreathouse to weigh in with their opinions.
Ah yes, I forgot about the Boost dependency in BOLT. I remember that we once had a big issue with the Boost.
Let's see what others will say.
Could you share the big issue you had? Would be nice to know what the 'real-world' problems people have, rather than just my experience on my personal development box.
I only hesitate, because I don't yet have a lot of feedback from end users that they like and use our superbuild
I understand, but do you have any feedback? What is their opinion about the project? In addition I think that the clSPARSE project is not yet very attractive. The one fancy thing we have is csrmv adaptive algorithm. We must provide more usefull algorithms to get users attention.
Could you share the big issue you had? Would be nice to know what the 'real-world' problems people have, rather than just my experience on my personal development box.
It's not an issue, I just know what are the requirements for algorithms like sorting matrix by row, AMG coaresing or performing transposition of matrix in CSR format. I will bring some real examples later.
I have not received significant feedback yet. Most people say they will try it out, but they can only get to it later. It's probably expecting too much for people to immediately evaluate it.
It's not an issue, I just know what are the requirements for algorithms like sorting matrix by row, AMG coaresing or performing transposition of matrix in CSR format.
I only meant specifically with regard to using Boost. What 'real world' problems does using boost bring to users?
If HSA Bolt is being discussed, I want to throw in Boost Compute in there. From what I am seeing Boost Compute is being more actively developed than HSA Bolt. Compute also has the benefit of being a header only library that can be pulled down during build time.
Could you share the big issue you had? Would be nice to know what the 'real-world' problems people have, rather than just my experience on my personal development box.
As far as I remember Bolt library has designated us a version of Boost, which had to use.
Boost compute is an interesting alternative. In my opinion the feature that this is a header only library is very attractive. The question is:
Does it have all necessary algorithms.
What is the performance in comparison to Bolt.
I saw that this library already supports OpenCL 2.0, at least headers says that.
From my experiments they are pretty close to each other performance and feature wise.
Here are some results from boost compute http://boostorg.github.io/compute/boost_compute/performance.html
Looking at the Boost Compute vector class I don't see simple way to create vector object from preallocated buffer. This concept we are using very often. I know that Bolt have such capability.
@jpola This is what we are doing to achieve that:
https://github.com/arrayfire/arrayfire/blob/devel/src/backend/opencl/kernel/sort.hpp#L51
For reference (*val.data)()
returns a cl_mem
.
Ok, thanks I get it.
2015-09-02 19:46 GMT+02:00 Pavan Yalamanchili notifications@github.com:
@jpola https://github.com/jpola This is what we are doing to achieve
that:https://github.com/arrayfire/arrayfire/blob/devel/src/backend/opencl/kernel/sort.hpp#L51
For reference (*val.data)() returns a cl_mem.
—
Reply to this email directly or view it on GitHub
#138 (comment)
.
Thanks for your feedback, @pavanky . You are right; if Bolt is up for consideration, then Boost.Compute should be too. It's not actually a part of Boost yet (as of 1.59), so it should count as a separate dependency from boost itself (for now), which requires boost as a dependency (same as Bolt).
@jpola Regarding boost.compute, I think that I would love to see an example of its use inside of clsparse. Give us a chance to evaluate how it cleans up the code, and see if there is a performance impact. I'm especially interested in seeing how their vector class could be used (possibly replacing our own). Do you think that the CG solver routine would be a good use case, or is there something even simpler we could could use?
Do you have time to hack a demo routine (in a fork)?
Ok,
I will produce several exemples.
2015-09-08 7:36 GMT+02:00 Kent Knox notifications@github.com:
@jpola https://github.com/jpola Regarding boost.compute, I think that I
would love to see an example of its use inside of clsparse. Give us a
chance to evaluate how it cleans up the code, and see if there is a
performance impact. I'm especially interested in seeing how their vector
class could be used (possibly replacing our own). Do you think that the CG
solver routine would be a good use case, or is there something even simpler
we could could use?Do you have time to hack a demo routine (in a fork)?
—
Reply to this email directly or view it on GitHub
#138 (comment)
.
@jpola I would be most interested to know, if this would allow us to be rid of the contentious cl.hpp file. From my brief look, I don't think it will (no reference counting on command_queues or events), but I am interested to know if it does provide some functionality of the C++ interfaces.
Status: For several days I'm intensively testing boost/compute library. My tests revealed several bugs which were reported on the project page. When they will be fixed I would like to present you my research.
What I found untill now is:
- Blas 1 operations seems to be slightly faster using boost/compute
- Our code can be greatly reduced, we can get rid of reduce, scan, sort, operations etc. Wee need to just take care of the core sparse functions like SPMV SPMdM etc. Rest will be managed by boost/compute. In addition if we will decide to support CPUs it will be easier for us to focus on our functions.
- It is very easy to work with this library. No need to compile, I connected boost/compute to the our CMake in 10 minutes. It is clean and easy to understand.
- It will definitely help me with developing more advanced codes thanks to 'fancy iterators' (zip_iterator, counting_iterator)
- It seems to support OpenCL 2.0
- It have a nice functionality to manage platforms and devices which will be usefull for benchmarking or testing i.e:
compute::detail::is_amd_device(device)
- boost/compute is almost ready to be included as an official boost component.
- Many others I don't have currently in mind...
Cons: IDE is going crazy often when resolving functions names (ctrl + click on function)
@jpola This is great research. We are most definitely interested in being able to support CPU devices, which would allow us to be able to run basic unit smoke tests per-commit in Travis or Appveyor.
With regard to compute::detail::is_amd_device()
, I think we want to get away from platform checks, re:this complaint. We should only test/check devices for necessary features, such as 'double precision'.
When you say boost.compute supports cl2.0, do you mean the vector class allocates memory with clSVMalloc(), or is it using other features like device enqueue?
Is the IDE going crazy only on boost functions, or does it mess with functions in the clspare project? Does it only mess up QT creator, or is it also a problem with visual studio?
If we work on our multiply routines: spmv, spmdm, spmm, and algorithms traversing csr matrix in similar way it will be quite easy to support CPU devices if we rely on boost.
compute::detail::is_amd_device is just an example, with other functions we can query for many other features like appropriate extension very easy.
When It comes to cl2.0, they have some functionality but I don't think that it is using svm as allocator for containers, unfortunately.
I'm not developing under VS. I move to the Windows only when something is not working there, and I did not check how it behaves. The clsparse functions are resolved little bit slower but for boost/compute functions I have to wait like 1 or 4 seconds to be transferred to proper file. It is really frustrating sometimes!
If we are interested in future research I can move whole project to the boost in about two weeks. Then you will have full scope how it will look like. Tests and benchmarks will remain untouched, I will take care only what is under the hood of clSPARSE eliminating redundant files, kernels, containers etc.
I like the idea of relying on another library for these (usually dense) primitives. The one thing we may lose, however, is the ability to add the EXTENDED_PRECISION mechanism (compensated arithmetic) throughout the rest of our library.
This doesn't matter as much for e.g. sort operations. However, scan, reduce, and other operations that perform arithmetic may benefit from this technique. IMO, this is a feature that other libraries do not have, and may be a good selling point. By relying on a library out of our control, we lose the ability to add this feature throughout clSPARSE.
I suppose one way to handle this would be to rely on Boost in the normal case and fall back to custom implementations (as we have now) if we want to go through the process of adding EXTENDED_PRECISION to other parts of the library besides csrmv.
Those is an excellent point @jlgreathouse , and something to consider. @jpola , do you have a fork/branch with example code we could look over? Something that we can see the 'before' & 'after' boost.compute refactoring?
I agree @jlgreathouse, extended precision is a big benefit of clSPARSE. I have branch here: https://github.com/jpola/clSPARSE/tree/intro_boost_compute
search for "compute::" to find places where boost compute is present. New functionality I've added is transposition of csr matrix. In several places I reimplemented our algorithms or added new version with suffix "2". It is not well organized since I was just testing boost.
I'm waiting for your comments.
Hi @jpola
My overall impression is that I really like how boost.compute simplifies the library code. Zip-iterators are cool.... Using boost.compute should significantly increase library maintainability, and the fact that you were able to implement csr transpose so fast is significant (as long as performance is good, which all indications say is good).
Re: @jlgreathouse feedback about the ability to customize kernels to do additional functionality like our EXTENDED_PRECISION. I thought about it over the weekend, and while i think it is something to keep in mind, we have not yet investigated to determine if reductions or scans would ever need EXTENDED_PRECISION. Possibly, all the real-world vectors the library reduces end up being small. Also, perhaps (i have not tried) there is a clever way to do compensated arithmetic on reductions with zip iterators (to add running summation) and transform_reduce()
.
My code review items:
- For axpy, is alpha on host a limitation of boost.compute or is there a way around the most requirement? Is there an advantage to Compute Function over Compote Closure? I don't see you using the alpha parameter in the function version.
- For Nrm2, can sqrt be called as a transform() on a scalar value?
- Is this comment stale with what Pavan wrote above?
- Does alpha have to be on host for scale? Seems like that could be optimized out?
- Seems like we are having both functional and performance problems with uBLAS. I am starting to have doubts we should be using it.
- Is there any need for clsparse to export our L1 dense functions? Are they all redundant with the functionality in boost.compute, i.e. can we deprecate all of them? I consider reducing our API surface area to be equivalent to reducing maintenance costs and good code hygiene.
For axpy, is alpha on host a limitation of boost.compute or is there a way around the most requirement?
For now I don't see any solution for keeping alpha on device. My natural approach was to create constant iterator from the value from the buffer, then zip x, y and alpha (as constant iterator) and use COMPUTE FUNCTION. Unfortunatelly boost compute can't create constant iterator form device value. At least for now. There might be some other possibilities with usiong function iterators + lambda functionality.
Is there an advantage to Compute Function over Compote Closure? I don't see you using the alpha parameter in the function version.
I used compute closure because it seems that there is bug in amd opencl compiler please see here:
This bug does not allow me to use compute function in the format I need (transform + zip iterator + constant iterator) on AMD device. Therefore I moved to compute closure, however I don't know how the compute closure works in details. My worries are that with any change of the alpha value the new version of kernel is created. I've asked here: https://groups.google.com/forum/?hl=pl#!topic/boost-compute/EEJe2nx9EX8 but nobody answered yet.
I don't see you using the alpha parameter in the function version.
Alpha should be the third parameter of the tuple passed to function transforming the values into oputput, alpha should be passed as constant iterator. Correct implementation should be as follows:
BOOST_COMPUTE_FUNCTION(float, function_axpy, (boost::tuple<float, float, float> x),
{
return boost_tuple_get(t, 1) + boost_tuple_get(t, 0) * boost_tuple_get(t, 2);
});
For Nrm2, can sqrt be called as a transform() on a scalar value?
It can be as follows, I just didn't know what would be faster copy or call kernel on single value:
compute::transform( compute::make_buffer_iterator(sBuffer), //begin
compute::make_buffer_iterator(sBuffer), //end
compute::make_buffer_iterator(sBuffer), // output begin
compute::sqrt<T>(), //functor
b_queue);
Is this comment stale with what Pavan wrote above?
Pavan mention about compute::make_buffer_iterator which allows to use buffer variables as iterators
By that i just rather wanted to have something like this
try {
compute::vector<T> sBuffer (s->value, s->num_values);
compute::vector<T> xBuffer (x->values, x->num_values);
compute::command_queue b_queue(control->queue());
compute::reduce(xBuffer.begin(), xBuffer.end(),
sBuffer.begin(),
b_queue);
//we should enable exceptions to get something meaningfull here especially from OpenCL
} catch (...)
{
std::cerr << "Boost compute failed with reduce operation" << std::endl;
return clsparseInvalidValue;
}
I was just complaining that code would look more elegant comared to that where I have to use compute::make_buffer_iterator, but I think we could write some macro which may shorten the long description.
Does alpha have to be on host for scale? Seems like that could be optimized out?
As I wrote in the comment for this code in our current implementation we are also mapping alpha to check its value and call fill buffer or actual scale kernel, that why I allowed myself to do similarly here. Of course there should be a branch to call fill buffer or call scale kernel. I just didn't want to have if statement for alpha in the kernel.
Seems like we are having both functional and performance problems with uBLAS. I am starting to have doubts we should be using it.
Yep, for now it is giving me more troubles than helps. I spend a day fighting with uBLAS checking WTF is wrong with my code, but there are some implementation subtleness which may not have infuluence on the code which is fully written in uBLAS but there isn't also any good documentation except stackoverflow. I would check the EIgen instead of uBLAS.
Is there any need for clsparse to export our L1 dense functions?
I as a user would like to have them. I know that they are silly to implement but I can imagine that user which will use our library intensively should expect that such basic functionality will be present to build more advanced algorithms.
Here's another question for you:
Can we safely drop our dependence on cl.hpp? I see that most of your uses on cl.hpp are for calling get_info<> on various cl.hpp classes. It seems like boost.compute also provides functionality for this, but it doesn't appear to have everything. For instance, you have used CL_KERNEL_ARG_ADDRESS_QUALIFIER and CL_KERNEL_ARG_ADDRESS_LOCAL which is likely missing from boost.compute. You only use these from within an assert, so I think we can 'safely' drop those calls. This could increase our library portability, because various platforms and SDK's do not ship cl.hpp, and it appears that in the future cl.hpp -> cl2.hpp.
Would you agree?
OK, I agree.
2015-09-21 20:40 GMT+02:00 Kent Knox notifications@github.com:
Here's another question for you:
Can we safely drop our dependence on cl.hpp? I see that most of your uses
on cl.hpp are for calling get_info<> on various cl.hpp classes. It seems
like boost.compute also provides functionality for this
https://github.com/boostorg/compute/blob/master/include/boost/compute/device.hpp#L466,
but it doesn't appear to have everything. For instance, you have used
CL_KERNEL_ARG_ADDRESS_QUALIFIER and CL_KERNEL_ARG_ADDRESS_LOCAL which is
likely missing from boost.compute. You only use these from within an
assert, so I think we can 'safely' drop those calls. This could increase
our library portability, because various platforms and SDK's do not ship
cl.hpp, and it appears that in the future cl.hpp -> cl2.hpp.Would you agree?
—
Reply to this email directly or view it on GitHub
#138 (comment)
.
I think our biggest problem is that reductions have to be written to host. How badly would this affect the performance of your solvers? My gut says a lot; I think it might make sense to ask their mailing list if there is a workaround. Maybe the right answer is to not use boost.compute for the solver routines.
Of course there should be a branch to call fill buffer or call scale kernel. I just didn't want to have if statement for alpha in the kernel.
I may not have understand your statement, but why check alpha at all? I assume that multiplying by 0 is the same perf as filling 0's (your original algorithm would not read alpha to host at all, or transfer a pattern to device). It seems to me that boost.compute does have a limitation here (if constant iterators have to live on host).
I as a user would like to have them
We should keep them if they make sense as building blocks. Just want to explicitly ask the question, that our functions are not redundant with what users can get themselves with boost.compute.
I think our biggest problem is that reductions have to be written to host. How badly would this affect the performance of your solvers? My gut says a lot; I think it might make sense to ask their mailing list if there is a workaround. Maybe the right answer is to not use boost.compute for the solver routines.
Yeah, that is true. Give me several days, to see if maybe I can find something worth trying.
I may not have understand your statement, but why check alpha at all? I assume that multiplying by 0 is the same perf as filling 0's (your original algorithm would not read alpha to host at all, or transfer a pattern to device). It seems to me that boost.compute does have a limitation here (if constant iterators have to live on host).
This arise from here: https://github.com/clMathLibraries/clSPARSE/blob/develop/src/library/blas1/cldense-axpby.cpp#L42
we had a little chat whether I should call distinct function depending on the value of parameters, unfortunately I can't find it.
I think our biggest problem is that reductions have to be written to host. How badly would this affect the performance of your solvers? My gut says a lot; I think it might make sense to ask their mailing list if there is a workaround. Maybe the right answer is to not use boost.compute for the solver routines.
Silly me, there is no issue at all. Functions compute::transform and compute::reduce generates the results directly to device. There is no impact on the solver performance.
Am I looking at the right reduce? The example shows sum as a host variable; maybe that's tripping me up. Can the output iterator also work with compute::make_buffer_iterator
?
Yes, the output iterator can be a compute::buffer.
2015-09-22 0:26 GMT+02:00 Kent Knox notifications@github.com:
Am I looking at the right reduce
http://boostorg.github.io/compute/boost/compute/reduce.html? The
example shows sum as a host variable; maybe that's tripping me up. Can the
output iterator also work with compute::make_buffer_iterator?—
Reply to this email directly or view it on GitHub
#138 (comment)
.
👍
I think I'm convinced that using boost.compute would be an overall win for the library, given we introduce our first dependency in the library code proper.
I'm willing to listen to other opinions: @jlgreathouse @kvaragan
One more comment. I had to use develop version because there was
sort_by_key algorithm which was not yet present in the master branch.
2015-09-22 0:39 GMT+02:00 Kent Knox notifications@github.com:
[image: 👍]
I think I'm convinced that using boost.compute would be an overall win for
the library, given we introduce our first dependency in the library code
proper.I'm willing to listen to other opinions: @jlgreathouse
https://github.com/jlgreathouse @kvaragan https://github.com/kvaragan—
Reply to this email directly or view it on GitHub
#138 (comment)
.
@jpola I have no problems with using the develop branch (easily controllable by cmake). I assume that will be the version that would eventually be distributed in Boost.
Introducing external library dependencies will have their own advantages/disadvantages, but it boils down to whether customer of clSPARSE will be comfortable with such changes. We will be better able to decide if we can get any such customer feedback.
@kvaragan I can't talk for everyone but from ArrayFire's perspective going with Boost.Compute is probably better than re-implementing it from scratch. The only down side can be if the version available to the general public is different than the version being used by clSPARSE.
I think using the develop branch while clSPARSE is pre-release is OK as long as the final version can work with the latest release.
I did think about this some, and I believe even the version used by clsparse should not make much of a difference to users since it is a header only library.
@kknox This will not make much of a difference to a user who is building from source. But what we have realized is that we need to be a little bit more strict with the dependencies when the library is being packaged for various Linux distributions.
@pavanky I suppose there could be a possibility of symbol conflicts for users who statically link to clsparse.a, which might have used a different boost.compute version than their own. I believe dynamically linking to clsparse.so would protect against that; is that your experience?
package maintainers they prefer to use the version of a package already present in their repos
Our policy is to always link dependencies dynamically. If a dependency is not available in the archive, it has to be packaged first. For instance, the packaging of the OpenCL backend of ArrayFire has been delayed, since I had to get clBLAS
, clFFT
and Boost.Compute
in the archive first.
As long as clSPARSE can be built with the system's Boost.Compute
instead of a vendorized / downloaded version, it should be fine from a packaging standpoint. If a development version is needed due to an upstream bug (like the sort_by_key mentioned by @jpola), I am happy to update the packaged version with the corresponding upstream patch.