integration to tinny-cnn

Question

integration to tinny-cnn

edgarriba opened this issue 8 years ago · 100 comments

I open this ticket in order to discuss ideas for integration libdnn to tiny-cnn.

Currently, I implemented a small interface to get native OpenCL context from tiny-cnn:
https://github.com/edgarriba/tiny-cnn/blob/f4d9e1d4f45ad8ac46824b5c007f5507fe8925eb/tiny_cnn/core/session.h

Things I think that are needed:

Implement a module for data transfer between devices
Discuss the shape of libdnn simplified interface if is needed.

BTW, @naibaf7 @hughperkins notice that we are planing to migrate tiny-cnn to an organization account and renaming the library itself since now it's more a pure DNN lib than just CNN. Maybe you are interested in getting more involved in the development. tiny-dnn/tiny-dnn#235

Answer 1 · 2016-07-18T11:06:55.000Z

@edgarriba
Ok, I'll look into it. What is your idea about the data transfer and simplified interface?

Answer 2 · 2016-07-18T15:07:08.000Z

@naibaf7
for memory syncing I was thinking in an approach similar to TF. What do you think it's the best to start?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/copy_tensor.cc#L46

Regarding simplified interface, just to make clear what type of data will need libdnn interfaces.

Answer 3 · 2016-07-18T20:06:18.000Z

@naibaf7 What do you think of this design?

Answer 4 · 2016-07-18T20:30:04.000Z

@bhack
Not bad, but essentially no different from functions already existing in either ViennaCL or Caffe code.
It's really unfortunate that tiny-cnn can't handle any GPU related management itself at the moment, as both cuDNN and libDNN expect a proper GPU & memory initialization beforehand (and I oriented the libDNN interface closely to the cuDNN one).

Answer 5 · 2016-07-18T21:08:21.000Z

What do you want to be our responsabilities? Device, memory and context? Kernel launch?

Answer 6 · 2016-07-18T21:34:18.000Z

@bhack
If you want compatibility with typical GPU libraries, then optimally device, memory and context. Kernel launch is handled enclosed in both cuDNN and libDNN, because this needs tricky parameter selection.

But if you can wait 2 days, I will update libDNN with simplifications for the device, memory and context part (basically copied over from Caffe). Basically just wrappers for memory allocation, device listing & initialization & memory transfer around CUDA and ViennaCL.

Answer 7 · 2016-07-18T21:38:48.000Z

We have no problem to handle device, memory and context. If you think that would be useful to have this features here ok. If not we will implement this in tiny. We need to think at the largest audience for this library. So if you think that generally callers benefit for handling context, device and memories itself for other kind of operations it is ok for us to implement in tiny and go ahead with other kernel coverage porting here.

Answer 8 · 2016-07-27T09:42:08.000Z

@naibaf7
I leave here the error report I get: http://pastebin.com/yv6rmu21

Answer 9 · 2016-07-27T09:44:18.000Z

@edgarriba
What steps do you use to test the integration (just so I can replicate)...?
The error looks familiar from early OpenCL Caffe tests, so it should be something I can fix.

Answer 10 · 2016-07-27T09:49:28.000Z

with this branch: https://github.com/edgarriba/tiny-cnn/tree/libdnn

  cmake -DBUILD_TESTS=ON -DUSE_OPENCL=ON -DUSE_LIBDNN=ON -DUSE_TBB=ON ..
  make && ./test/tiny_cnn_test

important routines:
https://github.com/edgarriba/tiny-cnn/blob/libdnn/tiny_cnn/core/backend_dnn.h#L76
https://github.com/edgarriba/tiny-cnn/blob/libdnn/tiny_cnn/core/kernels/libdnn_conv2d_kernel.h#L54

Answer 11 · 2016-07-28T04:06:46.000Z

@edgarriba
Ok I gained a good understanding of what you are trying to do and what issues the code has. I filed a pull request on your tinycnn/libdnn branch which fixes some of the issues and explains some details of the problems that still exist.

If you need more assistance in fixing these, or want to discuss it in more details, we can schedule a Skype conversation with screen sharing and revise the code that way.

Answer 12 · 2016-08-03T14:59:32.000Z

By the way, random observation, not sure if this is useful or not, but if you use the clew abstraction layer, then you can build for opencl, without libOpenCL.so being present, and you can even run your program, without libOpenCL.so being present. Your program can make a runtime-decision about whether to try binding with opencl or not. Actually, I think it can even attempt to call clewInit(), and thus detect if libOpenCL.so is present, and so on, and deal accordingly. (not sure about this last sentence, would be easy to fix this if it's not actually the case)

Answer 13 · 2016-08-03T16:39:50.000Z

@hughperkins Intriguing!
Does that also work with CUDA?
Because it would be really useful if there could be a universal binary that has no fixed dependencies beyond C++11 :)
What about multi-location search for libOpenCL.so and nvrtc.so/cuda.so? Can it figure out these different locations?

Answer 14 · 2016-08-04T02:53:40.000Z

Does that also work with CUDA?

No

Because it would be really useful if there could be a universal binary that has no fixed dependencies beyond C++11 :)

:-) Well, I geuss someone could create something analagous for cuda. Maybe... you? :-)

What about multi-location search for libOpenCL.so and nvrtc.so/cuda.so? Can it figure out these different locations?

For libOpenCL.so, it searches in a couple of hard-coded locations:

https://github.com/hughperkins/clew/blob/master/src/clew.c#L180-L184

    module = CLEW_DYNLIB_OPEN("OpenCL.dll");
    if(module == 0) module = CLEW_DYNLIB_OPEN("/Library/Frameworks/OpenCL.framework/OpenCL");
    if(module == 0) module = CLEW_DYNLIB_OPEN("libOpenCL.so");
    if(module == 0) module = CLEW_DYNLIB_OPEN("libOpenCL.so.1");
    if(module == 0) module = CLEW_DYNLIB_OPEN("/usr/lib/libOpenCL.so");
    if(module == 0) module = CLEW_DYNLIB_OPEN("/usr/lib/libOpenCL.so.1");

You can see that there is no reason why this couldnt be extended arbitrarily and/or read from some config file. But this covers almost all, or perhaps all?, cases I've seen, I think?

Answer 15 · 2016-08-04T12:52:23.000Z

@hughperkins
Yeah, I see...
yes maybe you could add possible locations for SDKs such as AMDs APP SDK libOpenCL.so and nVidias libOpenCL.so within CUDA :)
This also works fine with the OpenCL ICD loader I assume?

Answer 16 · 2016-08-04T21:37:27.000Z

https://github.com/CudaWrangler/cuew

Answer 17 · 2016-08-05T00:54:35.000Z

Wow, you are just a mine of useful information bhack. Its amazing. Are you are AI? :-P

Answer 18 · 2016-08-05T00:56:16.000Z

@bhack @hughperkins
More importantly, why does bhack know that much and take all the time to respond? :) amazing...
but yeah oh well, cuew and clew are two more items to integrate into libDNN then... endless list of work :)

Answer 19 · 2016-08-05T00:57:07.000Z

This also works fine with the OpenCL ICD loader I assume?

yes. the sequence is:

clew => loads libopencl.so => reads icd => loads vendor drivers => loads device information etc

Answer 20 · 2016-08-05T06:55:53.000Z

@edgarriba Can you post the UML rendered image of the integration proposal?

Answer 21 · 2016-08-05T06:58:26.000Z

I don't think that Caffe it is the right testbed for libdnn cause libdnn was in some sense designed around Caffe. So if you want to give some feedback on Edgar design...

Answer 22 · 2016-08-05T07:09:53.000Z

http://uml.mvnsearch.org/gist/df8ae6e567cb85ea62e7d20a8081f89a

Answer 23 · 2016-08-05T07:10:16.000Z

feedback is welcomed!

Answer 24 · 2016-08-05T11:34:00.000Z

Another interesting roadmap to monitor is https://phabricator.pmoreau.org/w/mesa/opencl_through_spirv_current_status/

Answer 25 · 2016-08-05T20:55:33.000Z

@CNugteren I think that you could be interested in the last messages.

Answer 26 · 2016-08-06T09:57:35.000Z

As you can see we have started to use CLCudaAPI for Tiny but also integrating libdnn for convolutional kernels. @CNugteren told that he was interested to contribute in libdnn. By this GSoC integration experiment I see a little bit of duplication of features. Libdnn actually maintain its tuning class and Cedric has CLtune. Libdnn uses ViennaCL as helper class for Opencl/cuda and Cedric has CLCudaAPI. Libdnn use some algebra from Vienna and Cedric from CLBlast. Is there a little bit of efforts duplication?

Answer 27 · 2016-08-06T11:58:27.000Z

@bhack
Yes there certainly is a little bit of duplication, but I don't think that's an issue.
Inside Caffe I even use the duplication as an advantage and offer clBLAS, CLBlast and ViennaCL for the linear algebra part.
It's just that especially with the algebra, different libraries perform good or bad on different devices.

Answer 28 · 2016-08-06T12:01:37.000Z

We cannot share the tuning component? Also, I think CLCudaAPI it is neutral enougth to use Vienna and CLBast algebra.

Answer 29 · 2016-08-06T12:13:14.000Z

@bhack
No the way the code generation works in libDNN is too different from the CLtune approach I think.
Yes, when you use CLCudaAPI you can also make use of multiple BLAS libraries, of course.

Answer 30 · 2016-08-06T12:50:02.000Z

I think naibaf7 is right: libDNN generates its OpenCL kernels from C++ code, whereas CLTune assumes that you already wrote a complete and functioning OpenCL kernel with tuning parameters exposed through the pre-processor (#define and #if - #else - #endif). Re-writing libDNN in this way is of course possible, but if the tuning is already working as it is now, why bother?

Using CLCudaAPI is orthogonal to using CLTune of course. It is just a way to make using OpenCL from C++ a lot more friendlier with the added bonus of making porting to CUDA as easy as changing a header.

Answer 31 · 2016-08-07T08:51:27.000Z

OK so I don't understand what it is you possible contribution or interest in libdnn? Also your kernel will not work with the tuning utility in libdnn.

Answer 32 · 2016-08-07T09:25:55.000Z

What do you mean with "your kernel"?

If I remember correctly my original proposal was to extend CLBlast to provide any necessary (supporting) kernels for libDNN. This is what I said back then:

I am also willing to contribute on the kernel development and tuning. In particular if there is something that would be fit as an extension to my BLAS library such as a batched GEMM or some other special version of matrix-multiplication.

So let me know if something like that is required. And I can also support the project from a CLCudaAPI perspective if you decide to use it.

Answer 33 · 2016-08-07T09:29:50.000Z

We are already using CLCudaAPI in tinycnn. What I still don't understand how internal libdnn tuning class and CLBlast that instead use CLTune could work togheter if tuning approaches are so different.

Answer 34 · 2016-08-07T14:33:53.000Z

Perhaps we are not on the same page, I don't fully understand your comments. A normal library call is an option from libDNN, or not?

Most of the CNN related functionality is quite close to BLAS, so I guess it would be useful to rely on a BLAS library for things like GEMV (fully-connected layer?) or GEMM (convolution layer?). Of course, in many cases you can do better than that, and that's why libDNN has it's own kernels as well. But if I can provide a kernel in CLBlast that is quite BLAS-y, then I would be happy to implement it and libDNN (and others) can then use it. For example, I know that cuBLAS has a batched-GEMM implementation, which is used for CNNs.

Answer 35 · 2016-08-07T14:45:07.000Z

@CNugteren
Yes that's spot-on. Intel MKL also has such operations: https://software.intel.com/en-us/articles/introducing-batch-gemm-operations

I found that for fully-connected (inner product) layers, it is still the best option to just use GEMM. Do you know how these layers/operations are working in-detail?

Answer 36 · 2016-08-07T14:47:08.000Z

As we have discussed seems that @naibaf7 would be confined with convolution here. CLBlast it is used already upstream in Caffe Opencl branch. And libdnn seems quite oriented on autotuining. Make sense to use another external autotuining framework for tune and use CLBlast kernels?

Answer 37 · 2016-08-08T08:29:22.000Z

For the batched GEMM on GPU I suggest you to read Performance, Design, and Autotuning of
Batched GEMM for GPUs

Answer 38 · 2016-08-08T08:38:41.000Z

Also.. can we really match Winograd performance on convolutions with most frequent kernels and batch sizes? See last benchmarks

Answer 39 · 2016-08-08T10:08:55.000Z

@bhack
For now we can't match assembly-written hand-optimized kernels. Most devices should land on cuDNN V2-V3 performance with the current kernels, relative to their TFLOP peak performance.
That means the performance is well above cuBLAS and clBLAS based im2col/col2im implementations.

But as you know, for example @hughperkins is trying to reimplement winograd style kernels in OpenCL (however not sure where that progress is at the moment).

I'm currently experimenting with tuning options to get higher performance as well as possible FP16 and INT8 implementations.

Answer 40 · 2016-08-08T10:14:47.000Z

We have already 8bit quantizzation in Tiny with @wangyida. Actually we are using ARM neon gemmlowp kernels. We could try to plug also your kernels when available.

Answer 41 · 2016-08-08T10:32:09.000Z

Is Winograd less tunable as algorithm?

Answer 42 · 2016-08-08T10:39:47.000Z

@bhack
No that could also be tuned, but it's much more sensitive to implement; the compilers just don't get it right, it pretty much needs to be written in assembly for good performance.

Answer 43 · 2016-08-08T10:41:49.000Z

But as you know, for example @hughperkins is trying to reimplement winograd style kernels in OpenCL (however not sure where that progress is at the moment).

Progress is not going really. libdnn having cudnn v2 performance sounds pretty good. I might just throw libdnn into deepcl and cltorch, and call it a day.

Answer 44 · 2016-08-08T10:43:34.000Z

@hughperkins
Oh okay, is it due to time limitations or did any major roadblock show up?

Answer 45 · 2016-08-08T11:01:52.000Z

Oh okay, is it due to time limitations or did any major roadblock show up?

Well, it's pretty hard, and involves a lot of preparatory learning and experimentation. Meanwhile, it's not very aligned with my dayjob right now, and my dayjob has some kind of interesting challenges in it, which are tentatively enticing my attention :-)

I'm unclear how much effort is involved. It's not impossible that someone who knows what they are doing might just swap a few lines around, and drop the current execution time by an order of magntitude or two. I think that in my case, I'd first have to spend a loooottttt of time trying to learn to become 'someone who knows what they are doing ' :-)

Answer 46 · 2016-08-08T21:17:38.000Z

This seems to be interesting https://github.com/andravin/wincnn

Answer 47 · 2016-08-09T20:52:59.000Z

I'm really interested now how this fastest Neon/Nervana Nvidia assembly kernels will fit at Intel. /cc @scott-gray @garybradski @vpisarev

Answer 48 · 2016-08-09T21:33:33.000Z

@bhack
Oh wow that's big news indeed...

Answer 49 · 2016-08-09T21:34:13.000Z

I've moved on to OpenAI. I'll mainly be supporting TensorFlow now, but perhaps when Nervana/Intel finishes their graph backend I'll support that as well. They have a great team working on it.

Answer 50 · 2016-08-09T21:39:23.000Z

@scott-gray Another better news IMHO 😉 I hope we could reuse some of your work in our new graph backend in Tiny/Opencv and on fast kernel here (if your activity will be not so much related to Nvidia assembly 😄)

Answer 51 · 2016-09-18T13:32:14.000Z

@naibaf7 Have we v2 kernels available? BVLC/caffe@33702c6

Answer 52 · 2016-09-18T13:38:42.000Z

@gongzg are you interested to porting Spatial kernels here?

Answer 53 · 2016-09-18T14:18:20.000Z

@bhack
Yes, working on better kernels for the W9100 and RX480. While the old ones were in the memory interface bottleneck (mem load/write stalls), the new kernels are slightly faster and in a VGPRs / VALU bottleneck at 40% wavefronts. The cache efficiency is 85% on AMD GPUs with these new kernels.

The remaining Problem is that the LDS storage on AMD GPUs is too small to support larger local memory GEMM blocks (currently 64x64) and the VGPRs are too few to support larger register memory GEMM blocks (currently 4x4 = 16 MAD FLOPS/local load cycle and 8x4x4 = 128 MAD FLOPS/global load cycle (32 and 256 respectively counting them as full FLOPS)). That's enough to hide memory latency, but the offset computations for arbitrary convolutions are too expensive to get a good FLOPS:IPS ratio. So while this could reach 4.5 TFLOPs on a W9100 in GEMM applications (that means between 8 and 16 FLOPS/Byte, depending on the size of K (slim vs. wide matrix) exactly the computational intensity sweet-spot), it can only do 1.5 to 2 TFLOPs in convolutions (intense integer operations for offset computation).

But the memory interface is only used up to 20% now, which means I can try out new techniques such as compressed LUT based algorithms, which needs a bit of additional GPU memory but could potentially lift up the performance to the wanted 1000 to 1200 inferences / second in AlexNet (currently we're at around 600 to 700 for the W9100 and RX 480).

But yeah, let me do some more CodeXL ISA inspection and algorithm hacking before I port the new kernels to the standalone LibDNN, because the V2 kernels alone are not worth the effort yet.

Let me also inform you that int_tp (offset type) should be 64 bits for desktop implementations of OpenCL. CLinfo reports 64 bit addresses on the AMDGPU OpenCL implementation, and so does nVidia's. Having 32 bit offsets and 64 bit GPU pointers funnily leads to massively increased branching and register usage in both OpenCL compilers, which degrades performance.

I am still not able to fully get the Intel spatial kernels running on my systems, thus can't support it yet.

Stay tuned for more updates :)

Answer 54 · 2016-09-19T10:55:25.000Z

@bhack I would like to integrate spatial kernels into libdnn firstly. Then it could be part of libdnn to be merged into other dnn libraries. And first of all, I will work with @naibaf7 to get the spatial kernels work on his skylake machine. Currently, both beignet and the close source opencl sdk could work with the spatial kernels on my skylake machine. And we also have some beignet performnce patch under review, after those patches got merged, the beignet should get similar performance with the official OpenCL SDK.

Answer 55 · 2016-09-19T11:04:24.000Z

Yes as here I mean libdnn standalone. IMHO a solution is that given a device to libdnn it return the IR program and the best kernel name to launch.

Answer 56 · 2016-09-19T11:22:00.000Z

@bhack Yes that's definitely the goal here.
@gongzg I have a feeling it has to do with the kernel driver or device identifier on my particular system. With your latest fix that makes the kernels beignet compatible, I'm again at the issue where beignet can not determine the optimal workgroup size, and thus runs very slowly. I guess I will try the closed source SDK next then, having absolutely no luck with beignet under Fedora so far.

Answer 57 · 2016-09-19T13:06:44.000Z

@dividiti could you be interested in this?

Answer 58 · 2016-09-19T13:08:46.000Z

@naibaf7 Would you try to start the machine with a Debian testing/unstable live (modern kernel and drm)?

Answer 59 · 2016-09-19T17:01:14.000Z

@bhack
Yes I'll put that on a USB stick for testing, since there's quite a lot of stuff on the system now that I need to work with. Pretty sad however that kernel 4.7.3 with beignet GIT master and Fedora 25 drm is still not working :(
Trying the closed source implementation now.

Answer 60 · 2016-09-24T22:26:13.000Z

@naibaf7 Are you interested in provisional Kronos OpenVX Neural Net Extension (slide 21)?

Answer 61 · 2016-09-25T11:14:44.000Z

@bhack
Eh yes that sounds interesting.

Answer 62 · 2016-09-25T11:17:00.000Z

Can we start a new thread on Kronos?

Answer 63 · 2016-09-25T11:51:33.000Z

@bhack
Yes let me create a new account there and please link me to the thread.

Answer 64 · 2016-09-25T12:02:08.000Z

@bhack Nice slides. Thanks! :-)

Answer 65 · 2016-09-25T12:48:37.000Z

I will notify you soon. Probably next week for the thread.

Answer 66 · 2016-10-04T18:19:10.000Z

It is finally released at https://www.khronos.org/news/press/khronos-launches-dual-neural-network-standard-initiatives

Answer 67 · 2016-10-05T21:01:33.000Z

@bhack
I now have an engine for pooling in LibDNN; There is currently only a non-deterministic (atomic) backward pass, but I will also follow up with a deterministic version. But the non-deterministic one seems to be up to 5x faster than cuDNN pooling backward pass in AlexNet; The forward pass is about the same speed.

On OpenCL AMD GPUs, the forward and backward passes are 2x faster in 2D, 3x faster in 3D than the Caffe default engine (using batch size 128 AlexNet).

Let me know how urgently it's needed in tiny-cnn, so I can plan a bit when to update the LibDNN standalone release...

Answer 68 · 2016-10-05T21:25:37.000Z

How much of this not determinisism impact the accuracy?

Answer 69 · 2016-10-05T21:47:09.000Z

@bhack
Haven't done a training test yet (but it passes unit tests with 1e-5 kappa-value)... I guess it's less of an effect with pooling than with convolution, since no multiplications are involved and it's only summing up atomically over the kernel size (typically), so on around 4 to 12 values.
I'll let you know when I've tested convergence against the Caffe and cuDNN implementation.

I guess I will do an LeNet MNIST (don't have time to download ImageNet data now... x)) training test on OpenCL (AMD and nVidia) now :)

Answer 70 · 2016-10-05T21:51:36.000Z

You already know that some kind of noise could help on training.. ;)

Answer 71 · 2016-10-05T21:52:46.000Z

@bhack
Yeah well it depends, numerical and non-deterministic factors can go both ways...
We even noticed cuDNN training differences in the lab between nVidia Pascal and Maxwell because the kernels changed slightly ;)
The most important part is that forward passes stay as consistent as possible. At least when the hardware and software are fixed, the predictions should be the same on every forward pass.
Ideally also across engines and hardware; though that is not always the case.

Answer 72 · 2016-10-05T22:06:56.000Z

LeNet MNIST result (Caffe example, 10'000 iterations, Adam solver):

nVidia OpenCL (LibDNN conv+pool):

I1006 00:02:08.143371  8464 solver.cpp:425]     Test net output #0: accuracy = 0.9881
I1006 00:02:08.143398  8464 solver.cpp:425]     Test net output #1: loss = 0.0713053 (* 1 = 0.0713053 loss)

AMD OpenCL (LibDNN conv+pool):

I1006 00:04:20.490528  8733 solver.cpp:425]     Test net output #0: accuracy = 0.9877
I1006 00:04:20.490556  8733 solver.cpp:425]     Test net output #1: loss = 0.0612534 (* 1 = 0.0612534 loss)

nVidia CUDA (LibDNN conv+pool):

I1006 00:10:45.200371 10265 solver.cpp:425]     Test net output #0: accuracy = 0.9891
I1006 00:10:45.200395 10265 solver.cpp:425]     Test net output #1: loss = 0.0563766 (* 1 = 0.0563766 loss)

nVidia CUDA (cuDNN conv+pool):

I1006 00:05:24.104274  9026 solver.cpp:425]     Test net output #0: accuracy = 0.985
I1006 00:05:24.104305  9026 solver.cpp:425]     Test net output #1: loss = 0.0878605 (* 1 = 0.0878605 loss)

So I guess at least in this simple example it works fine. Results are reproducible within +-0.002 accuracy.

Answer 73 · 2016-10-05T22:11:10.000Z

Not so bad as a small scale test...

Answer 74 · 2016-10-05T22:12:24.000Z

@bhack
Still funny that there is a small but consistent accuracy difference between the implementations; even after repeating the training many times :)... yeah it's good small scale test because repeating it only takes a few seconds up to a minute on slower devices.

Answer 75 · 2016-10-06T00:39:41.000Z

Yes remeber also of the old BVLC/caffe#3168

Answer 76 · 2016-12-09T12:28:53.000Z

@naibaf7 What about removing device and contrxt creation from libdnn standalone and require that this kind of responsibility will be handled by third party application and then passed to libnn? Could this help code sync between Caffe libdnn and standalone version?

Answer 77 · 2016-12-09T12:37:33.000Z

@bhack
There already is no context and device creation in LibDNN - in fact you need to wrap an existing device into ViennaCL for LibDNN to be able to probe device capabilities and manage the device internally.
This is the central duplication of LibDNN with Caffe - inside the device class:
https://github.com/naibaf7/libdnn/blob/master/src/device.cpp
It furthermore simplifies alternation of device probing between CUDA and OpenCL.

The issue then is that I can't use a standalone LibDNN in Caffe because it would make compilation of Caffe a multistep process whereas it is very easy at the moment (when using ViennaCL and LibDNN, it's only one compile step and outside OS provided libraries, just header-only dependencies to Boost and ViennaCL). And then there'd be problems with duplicated functionality, such as the device class. Now I can just use the existing management for both Caffe and LibDNN, otherwise I'd also have to start re-wrapping the OpenCL device from Caffe::device to LibDNN::device.

Answer 78 · 2016-12-09T12:41:32.000Z

Sorry probably I've not explained well what I mean... It is possibile to pass only some device metadata to libdnn and to let execute and compile program on app side? Apps could comunicate timings to the tuner and retrive new source to compile and execute.

Answer 79 · 2016-12-09T12:43:54.000Z

@bhack
Not sure what you mean - only getting the kernel code and launch the kernel in your own OpenCL context without using LibDNN launch code then?
The problem here is that the tuning parameters and kernel properties need to be known to launch the kernels, as well as the data and parameter integrity checks that happen on the host-side before the OpenCL/CUDA kernels are being launched. So no, code generation/tuning and kernel launching are pretty much inseparable - hence the use of wrappers similar to cuDNN and BLAS frameworks.

Answer 80 · 2016-12-09T12:47:42.000Z

Exactly what I mean is to exange only metadata. So we could excange only metadata forward and back from libdnn and third party apps. So we don't need to replicate bootstrap and comping and execution code cause libdnn has only partial coverage of layers and so generally every third party apps will have it is own setup, compile and launch code.

Answer 81 · 2016-12-09T12:49:27.000Z

Yeah but look at this example on how complex launch code depends on kernel properties - I might be wrong here but I think wrapping a device to ViennaCL or any other wrapper is waay easier and more in line with how - for example cuDNN works - than keeping up with the launch parameter conformity for all kernel variants - a small excert here:

 int_tp ims = batch_size * channels;
  for (int_tp i = 0; i < im_in_shape_.size(); ++i) {
    ims *= im_in_shape_[i];
  }
  LibDNN<Dtype>::SetMemory(bottom_diff, ims, 0, (Dtype) 0);

  int_tp imsi = std::accumulate(im_in_shape_.begin(), im_in_shape_.end(),
                                1, std::multiplies<int_tp>());
  int_tp imso = std::accumulate(im_out_shape_.begin(), im_out_shape_.end(),
                                1, std::multiplies<int_tp>());

  int_tp imsw = 0;
  if (bwalgo_ == LIBDNN_POOLING_BW_ALGO_DIRECT) {
    // Direct kernel iterates over input size
    imsw = imsi;
  } else {
    // Atomic kernel iterates over output size
    imsw = imso;
  }

  int_tp lw0 = bw_tuner_->get_param<int_tp>("LW0");
  int_tp lw1 = bw_tuner_->get_param<int_tp>("LW1");

#ifdef USE_GREENTEA
  if (LibDNN<Dtype>::dev_ptr_->backend() == BACKEND_OpenCL) {
    viennacl::ocl::kernel &kernel =
        LibDNN<Dtype>::ocl_program_.get_kernel("pool_backward");
    viennacl::ocl::context &ctx =
        viennacl::ocl::get_context(LibDNN<Dtype>::dev_ptr_->id());

    kernel.local_work_size(0, lw0);
    kernel.local_work_size(1, lw1);
    kernel.local_work_size(2, 1);

    kernel.global_work_size(0, ((imsw - 1) / lw0 + 1) * lw0);
    kernel.global_work_size(1, ((channels * batch_size - 1) / lw1 + 1) * lw1);
    kernel.global_work_size(2, 1);

    switch (pool_method_) {
      case LIBDNN_POOLING_METHOD_MAX:
        if (use_top_mask_) {
          viennacl::ocl::enqueue(
                 kernel(WrapHandle((cl_mem) top_diff, &ctx),
                        WrapHandle((cl_mem) bottom_diff, &ctx),
                        WrapHandle((cl_mem) top_mask, &ctx),
                        channels,
                        batch_size),
                 ctx.get_queue());
        } else {
         viennacl::ocl::enqueue(
                kernel(WrapHandle((cl_mem) top_diff, &ctx),
                       WrapHandle((cl_mem) bottom_diff, &ctx),
                       WrapHandle((cl_mem) mask, &ctx),
                       channels,
                       batch_size),
                ctx.get_queue());
        }
        break;
      case LIBDNN_POOLING_METHOD_AVE:
        viennacl::ocl::enqueue(
               kernel(WrapHandle((cl_mem) top_diff, &ctx),
                      WrapHandle((cl_mem) bottom_diff, &ctx),
                      channels,
                      batch_size),
               ctx.get_queue());
        break;
      case LIBDNN_POOLING_METHOD_STO:
        viennacl::ocl::enqueue(
               kernel(WrapHandle((cl_mem) top_diff, &ctx),
                      WrapHandle((cl_mem) bottom_diff, &ctx),
                      WrapHandle((cl_mem) rand_idx, &ctx),
                      channels,
                      batch_size),
               ctx.get_queue());
        break;
    }
  }

You'd have to duplicate about this amount of code times 8 (cuda, opencl, forward, backward, convolution, pooling), and it will increase even more so with more kernel variants. Whereas with a wrapper, you need not worry about kernel variants at all - it will just work, and not many if any interface changes at all between versions of LibDNN.

Answer 82 · 2016-12-09T14:39:29.000Z

I don't know why we need to replicate all this code 8 times.. I.e. in our Tiny-dnn case where we use CLCUDAAPI we have for kernels this two interfaces:

template void SetArgument(const size_t index, const T &value): Method to set a kernel argument (l-value or r-value). The argument indexspecifies the position in the list of kernel arguments. The argument value can also be a CLCudaAPI::Buffer.

template <typename... Args> void SetArguments(Args&... args): As above, but now sets all arguments in one go, starting at index 0. This overwrites any previous arguments (if any). The parameter pack args takes any number of arguments of different types, including CLCudaAPI::Buffer.

Can this be asked to libdnn?

Answer 83 · 2016-12-09T15:34:37.000Z

@bhack
Yeah you could launch the kernels with that, however you still need a variant for each of the different kernels. And if I add a new kernel variant or change the interface to the kernel, you'd have to adapt all that again. Building a metadata exchange system that explains in all detail how the kernel arguments need to be set & what needs to be prepared before the kernel can be launched seems far more complex than wrapping the OpenCL device. I quite frankly don't see the point of making this more complex than it is / needs to be. The current binding code in tiny-dnn to libdnn is quite compact to what it would be otherwise... the interface has a clear definition of what tensors are expected, what function should be executed and what the effects on the tensors are. Extending the configuration structure with new options will not break the compatibility with frameworks using libdnn.
What do others think about this? @CNugteren @edgarriba @hughperkins ?

Answer 84 · 2016-12-09T15:41:06.000Z

It is not a particular problem for tiny-dnn cause actually libdnn it is already a shared object for us. It was just a tentative to find a path to let caffe and tiny-dnn use libdnn directly as upstream and to let collect more variants here (i.e. Intel kernels).

Answer 85 · 2016-12-09T15:44:40.000Z

@bhack
OK I see. I think the more sensible path I will take is to move a device abstraction framework from Caffe into LibDNN and base Caffe on LibDNN like that, but I don't have the time to try this modality out yet.
The metadata exchange would however definitely introduce too much maintenance and work overhead for me.
Adding kernel variants such as Intel kernels is not a big deal, it can be accomodated in LibDNN, but I think I need to coordinate this with @gongzg first. Since they don't have separate efficient backward kernels in their implementation yet it would be good to mix & match it with LibDNN backward kernels.

Answer 86 · 2016-12-09T15:47:16.000Z

Are you sure of that this is a plausibile path? You still needs all this stuffs for lauching kernels not covered by libdnn (full connected etc...) same as tiny-dnn.

Answer 87 · 2016-12-09T15:51:14.000Z

@bhack
Yes, I think it is. Like MKL, cuDNN and clBLAS as well, LibDNN will stay an acceleration library providing fast implementations with a uniform interface where appropriate. It is not a complete DNN framework, so the rest remains Caffe's, torch's, tensor-flow's and tiny-dnn's responsibility.
Focus remains on stuff like performance portability, different compile backends (CUDA, OpenCL, eventually OpenMP), performance critical DNN kernels, easy integration to deep learning frameworks.

Answer 88 · 2016-12-09T15:53:11.000Z

Yes is what I mean.. how could you move device abstraction from caffe to libdnn? You still need that in caffe for other kernels just like tiny-dnn.

Answer 89 · 2016-12-30T18:26:43.000Z

@naibaf7 How do you think that XLA will impact this project?

Answer 90 · 2016-12-30T18:33:40.000Z

Just as reference for XLA:
https://autodiff-workshop.github.io/slides/JeffDean.pdf
https://www.tensorflow.org/versions/master/resources/xla_prerelease

Answer 91 · 2016-12-31T02:23:51.000Z

@bhack
Not sure yet, maybe I'll also do some work on it. I'll have to see it in action and take a look at what operations it can do so far etc. first.
I've spoken with AMD and will continue to get their support for what I'm doing currently.
I definitely made similar observations in performance improvement when I implemented a simpler (only high-level optimizer; target independent) variant to XLA for the pooling operations in LibDNN.

Answer 92 · 2017-01-11T20:00:35.000Z

@naibaf7 if you are interested take a look at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler

Answer 93 · 2017-02-11T02:05:58.000Z

@bhack Bumped to latest kernel versions and added pooling kernels. Should be straight-forward to use again, don't hesitate to ask questions.
Hope this is useful for tiny-dnn.

Answer 94 · 2017-04-30T18:58:16.000Z

I have compiled and installed libdnn from github and got the following errors with tiny-dnn:

tiny-dnn/tiny_dnn/core/kernels/conv2d_op_libdnn.h:227:5: error: 'LibDNNConfig' is not a member of 'greentea'
greentea::LibDNNConfig config;

/tiny-dnn/tiny_dnn/core/kernels/conv2d_op_libdnn.h:229:5: error: 'config' was not declared in this scope
config.dev_ptr = dev_ptr_.get();

Any idea?

Thanks

Answer 95 · 2017-04-30T22:26:23.000Z

Oh... no it's an incompability issue. You need to ask them to change "LibDNNConfig" to "LibDNNConvConfig" in the source code.
Because LibDNN now also supports Pooling and Deconvolution, but they haven't updated their code to use that new version of LibDNN yet.

Answer 96 · 2017-04-30T22:30:00.000Z

@naibaf7 thanks to point out, we'll try to update during the summer

Answer 97 · 2017-04-30T22:32:04.000Z

@edgarriba
Yup and we should probably insert tags for compatibility. Do you use LibDNN as submodule with TinyDNN? If so, make sure that it always checks out the "latest known working" commit in accordance to your testing... LibDNN interface will likely not stay stable over the next few months, as many updates will be coming.

Answer 98 · 2017-04-30T22:34:31.000Z

right, we are not checking versions right now. It would be worth to do it. Opening an issue to fix that

Answer 99 · 2017-08-02T17:11:42.000Z

@naibaf7 We are planning to integrate soon libdnn with @Randl. Any feedbacks on last evolution? Is the library tagged?

Answer 100 · 2017-08-03T01:17:54.000Z

@bhack
I am currently working for a very time intensitive project on Caffe/LibDNN, but the standalone LibDNN didn't receive any new updates since we communicated last.