HuwCampbell/grenade

Gradient decent in C?

cpennington opened this issue · 9 comments

Why did you choose to write the gradient descent code in C, rather than using the library you used for the other matrix computations? Would you get a speedup by doing the descent in hblas?

In a word: fusion; or rather, lack of.

I had a version using hmatrix, but profiling showed it was taking up a large proportion of the runtime. I believe it was because it couldn't unroll the loops and work on one value at a time. The C rewrite was a good deal faster, and I have a benchmark on it in the suite (thought can't remember the speed up right now).

HBLAS might do it better, but again it's mostly a fusion issue. One might do better as well trying to aggressively use SIMD.

Ah, ok. I'm about this close (holds fingers close together) to trying to make an accelerate backend/branch/fork (but I'm not sure how much work that would take) to get fusion/gpu/simd for "free". Is that something you'd be interested, if I could make it work?

(Unrelatedly, I've also got some outstanding changes to make various things instances of NFData so that you can better control parallelism, and Num so that you can add gradients together. I'm not sure if either one is worth bringing upstream, though).

I would be interested (especially if there's benchmarks). In grenade, for most networks, most of the run time is matrix-matrix multiplications, which is pretty much what you want. I know CUDA/cuDNN would be faster, but I'm not sure how well accelerate does the tasks we need.

If you're using LSTMs, probably the one thing which would get the biggest easy improvement would be proper minibatching. Matrix-matrix multiplications with BLAS are far more efficient that n matrix vector multiplications. 50 examples in a matrix runs in about the same time as 5 in a vector for instance.

As for Num and NFData instances, that sounds reasonable, and I have also thought about adding them. The main reasons I didn't just make them Num and call it a day was efficiency and API usage; but I'm happy to look at anything you've come up with.

I added the updateGradients function to the UpdateLayer class so one could efficiently add gradients before an update, but it's clunky.

Thanks for the issue :)

So, I've started poking at an accelerate backend. I think I'm going to have to get a fair way into it before I figure out what the speed change is, though. I'll let you know what I see.

I'm at ICML at the moment, and have spoken with a few people who are interested in helping out in this effort. I might also talk to Trevor (who wrote accelerate) next meetup to see if he has any advice.

Neat. I'm happy to put what I have so far up on a branch... It's a bit fragmented so far, but as my first stab, I'm trying to replicate im2col in order to test out the benchmarks.

My main dev laptop isn't CUDA-friendly, so I won't be able to test the upper limits. Also, I suspect a bunch of the improvement will be once you're actually stacking multiple layers together and the fusion starts kicking in. In the project that's motivating all of this work, I've notice that the garbage collector is quick active in general.

If there's anything I can do to help with an accelerate back end let me know. I was about to take a look myself.

I chatted with Trevor today, and he is also interested in getting this working.

Just noticed this, figured it should be linked from here since it seems relevant: #38