static computation

Question

static computation

Opened this issue 8 years ago · 7 comments

Hi,
thanks for this nice package (and for Knet as well).
How difficult would be to support static computation, at least for a limited set of operations? Here is a comparison with ReverseDiff.jl where AutoGrad lags two orders of magnitude behind

julia> f(x) = sum(x->x^2,x)
f (generic function with 1 method)

julia> v=rand(100);

julia> @benchmark grad(f)(v)
BenchmarkTools.Trial: 
  memory estimate:  411.38 KiB
  allocs estimate:  9398
  --------------
  minimum time:     1.068 ms (0.00% GC)
  median time:      1.088 ms (0.00% GC)
  mean time:        1.182 ms (6.49% GC)
  maximum time:     5.658 ms (78.79% GC)
  --------------
  samples:          4204
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> df! = ReverseDiff.compile_gradient(f,v)
(::#301) (generic function with 1 method)

julia> y=ones(v);

julia> @benchmark df!(y,v)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.353 μs (0.00% GC)
  median time:      11.426 μs (0.00% GC)
  mean time:        11.636 μs (0.00% GC)
  maximum time:     35.284 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

I encounter the same 100x slowdown if I increase the size to v=rand(1000)

Cheers,
Carlo

Answer 1 · 2017-03-03T05:17:32.000Z

Hi Carlo, So far I have not spent too much effort on AutoGrad performance because: 1. I mainly use it for Knet, and AutoGrad accounts for less than 10% of the time cost in typical deep learning models. 2. One major feature of Knet is that it supports dynamic computational graphs, i.e. the ability to construct the CG at runtime so one can use arbitrary Julia code and change the operations of the model every iteration. Please keep the issue open, I'll profile your benchmark and see if there is an easy fix. best, deniz

…

On Fri, Mar 3, 2017 at 12:35 AM Carlo Lucibello ***@***.***> wrote: Hi, thanks for this nice package (and for Knet as well). How difficult would be to support static computation, at least for a limited set of operations? Here is a comparison with ReverseDiff.jl where AutoGrad lags two orders of magnitude behind julia> f(x) = sum(x->x^2,x) f (generic function with 1 method) julia> v=rand(100); julia> @benchmark grad(f)(v) BenchmarkTools.Trial: memory estimate: 411.38 KiB allocs estimate: 9398 -------------- minimum time: 1.068 ms (0.00% GC) median time: 1.088 ms (0.00% GC) mean time: 1.182 ms (6.49% GC) maximum time: 5.658 ms (78.79% GC) -------------- samples: 4204 evals/sample: 1 time tolerance: 5.00% memory tolerance: 1.00% julia> df! = ReverseDiff.compile_gradient(f,v) (::#301) (generic function with 1 method) julia> y=ones(v); julia> @benchmark df!(y,v) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 11.353 μs (0.00% GC) median time: 11.426 μs (0.00% GC) mean time: 11.636 μs (0.00% GC) maximum time: 35.284 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 time tolerance: 5.00% memory tolerance: 1.00% I encounter the same 100x slowdown if I increase the size to v=rand(1000) Cheers, Carlo — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#10>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNptuevouIyRdykj3Grsa9ioTS5jvOks5rhzYfgaJpZM4MRj9U> .

Answer 2 · 2017-03-03T15:43:57.000Z

2. One major feature of Knet is that it supports dynamic computational
graphs, i.e. the ability to construct the CG at runtime so one can use
arbitrary Julia code and change the operations of the model every iteration.

Note that ReverseDiff also supports this. Tape reuse/compilation is simply an additional feature for when you do, in fact, have a static CG (common in many of the non-ML applications I'm targeting with ReverseDiff).

Answer 3 · 2017-03-03T16:19:28.000Z

@jrevels did you try running any Knet examples with ReverseDiff?

Answer 4 · 2017-03-08T17:03:26.000Z

@jrevels did you try running any Knet examples with ReverseDiff?

Nope, that could be fun. Looking at the examples it seems like (in most cases) it'd be as easy as switching out the lossgradient with a ReverseDiff-generated gradient rather than an AutoGrad-generated one?

Answer 5 · 2017-03-08T17:22:43.000Z

That is what I was thinking... Do you support conditionals, for-loops, array/tuple/dict indexing etc?

…

On Wed, Mar 8, 2017 at 8:03 PM Jarrett Revels ***@***.***> wrote: @jrevels <https://github.com/jrevels> did you try running any Knet examples with ReverseDiff? Nope, that could be fun. Looking at the examples it seems like (in most cases) it'd be as easy as switching out the lossgradient with a ReverseDiff-generated gradient rather than an AutoGrad-generated one? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#10 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNpiwJKv5nEGa-Msxl78G_AyU-qcYbks5rjt9egaJpZM4MRj9U> .

Answer 6 · 2017-03-08T17:59:50.000Z

Yup - on the surface, ReverseDiff is standard operator-overloading reverse-mode AD, and supports all the things dynamically re-taping AD libraries generally support. Under the hood, there are a lot of Julia-specific optimizations thrown in, including per-instruction memory caching, mixed-mode AD and indexing elision. It's more in the ADOL-C tradition than the autograd tradition, where tape reuse is encouraged for code with static computation graphs.

I'm curious to see how code with dictionaries will fare. Theoretically, it should be fine, but it's not something I test for (I'm more in the traditional optimization world than the ML world). For example, ReverseDiff's API is currently only written to differentiate functions whose arguments are scalars or arrays (though dictionaries/arbitrary data structures are totally fair game within the function itself).

Answer 7 · 2017-07-27T11:05:54.000Z

See JuliaDiff/ReverseDiff.jl#77 for relevant discussion.