denizyuret/AutoGrad.jl

static computation

Opened this issue · 7 comments

Hi,
thanks for this nice package (and for Knet as well).
How difficult would be to support static computation, at least for a limited set of operations? Here is a comparison with ReverseDiff.jl where AutoGrad lags two orders of magnitude behind

julia> f(x) = sum(x->x^2,x)
f (generic function with 1 method)

julia> v=rand(100);

julia> @benchmark grad(f)(v)
BenchmarkTools.Trial: 
  memory estimate:  411.38 KiB
  allocs estimate:  9398
  --------------
  minimum time:     1.068 ms (0.00% GC)
  median time:      1.088 ms (0.00% GC)
  mean time:        1.182 ms (6.49% GC)
  maximum time:     5.658 ms (78.79% GC)
  --------------
  samples:          4204
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> df! = ReverseDiff.compile_gradient(f,v)
(::#301) (generic function with 1 method)

julia> y=ones(v);

julia> @benchmark df!(y,v)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.353 μs (0.00% GC)
  median time:      11.426 μs (0.00% GC)
  mean time:        11.636 μs (0.00% GC)
  maximum time:     35.284 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

I encounter the same 100x slowdown if I increase the size to v=rand(1000)

Cheers,
Carlo

2. One major feature of Knet is that it supports dynamic computational
graphs, i.e. the ability to construct the CG at runtime so one can use
arbitrary Julia code and change the operations of the model every iteration.

Note that ReverseDiff also supports this. Tape reuse/compilation is simply an additional feature for when you do, in fact, have a static CG (common in many of the non-ML applications I'm targeting with ReverseDiff).

@jrevels did you try running any Knet examples with ReverseDiff?

@jrevels did you try running any Knet examples with ReverseDiff?

Nope, that could be fun. Looking at the examples it seems like (in most cases) it'd be as easy as switching out the lossgradient with a ReverseDiff-generated gradient rather than an AutoGrad-generated one?

Yup - on the surface, ReverseDiff is standard operator-overloading reverse-mode AD, and supports all the things dynamically re-taping AD libraries generally support. Under the hood, there are a lot of Julia-specific optimizations thrown in, including per-instruction memory caching, mixed-mode AD and indexing elision. It's more in the ADOL-C tradition than the autograd tradition, where tape reuse is encouraged for code with static computation graphs.

I'm curious to see how code with dictionaries will fare. Theoretically, it should be fine, but it's not something I test for (I'm more in the traditional optimization world than the ML world). For example, ReverseDiff's API is currently only written to differentiate functions whose arguments are scalars or arrays (though dictionaries/arbitrary data structures are totally fair game within the function itself).

See JuliaDiff/ReverseDiff.jl#77 for relevant discussion.