renmengye/tensorflow-forward-ad

generating forward-mode graphs by calling tf.gradients twice

mattjj opened this issue · 18 comments

Based on some autograd code from @j-towns, @alextp and I recently had this idea for getting forward-mode JVPs in TensorFlow by calling tf.gradients twice:

import numpy as np
import numpy.random as npr
import tensorflow as tf

def fwd_gradients(ys, xs, d_xs):
  """Forward-mode pushforward analogous to the pullback defined by tf.gradients.
  With tf.gradients, grad_ys is the vector being pulled back, and here d_xs is
  the vector being pushed forward."""
  v = tf.placeholder(ys.dtype, shape=ys.get_shape())  # dummy variable
  g = tf.gradients(ys, xs, grad_ys=v)
  return tf.gradients(g, v, grad_ys=d_xs)

A = tf.constant(npr.randn(5, 3), dtype=tf.float32)
x = tf.placeholder(tf.float32, [1, 5])
y = tf.tanh(tf.matmul(x, A))
u = tf.placeholder(tf.float32, [1, 5])

jvp = fwd_gradients(y, x, u)

x_val = npr.randn(1, 5)
u_val = npr.randn(1, 5)

init_op = tf.initialize_all_variables()
with tf.Session() as sess:
  sess.run(init_op)
  print sess.run(jvp,  feed_dict={x: x_val, u: u_val})

The idea is that when a is a linear function of b, calling tf.gradients(a, b, u) applies the adjoint of that linear function to u. The same idea lets us get the adjoint of a vector-Jacobian product operator produced by tf.gradients applied to a nonlinear function, yielding a Jacobian-vector product operator for that nonlinear function. Notice that there's extra work being done ahead-of-time (i.e. graph construction time), but not at runtime (i.e. graph evaluation time), since the extra nodes being created by the first call to tf.gradients aren't needed for evaluating the JVP graph that results.

We haven't worked out all the details yet (or even tested it beyond the above script) so there could be some subtleties here that I'm missing, but I think in principle the TF graph that gets built by the fwd_gradients function above should be about as FLOP- and memory-efficient as possible. That statement has a lot of uncertainty on it at the moment, though!

We just wanted to give you a heads-up about this idea, and also to see if you had any thoughts.

Thanks a lot for the post. I like your idea, which is way more elegant, in terms of using the built-in reverse-mode AD in TensorFlow (although from implementation perspective, forward-mode should be a lot easier than reverse-mode). I have tested your idea in my code base, it passes all the unit tests, except this one, which I reported as an issue to TF: tensorflow/tensorflow#9368.

I can take a look, but this one of the core module of tensorflow, so better be changed by some of their core developers. In the mean time, you can use other versions of cross entropy which should be fine.

This is a really neat trick! I believe this can be as time and memory efficient at runtime as an explicit implementation of the forward-mode computation, but it depends on the details of how the gradient of the gradient of each op involved is implemented. Mathematically a certain portion (grad_v below) of the gradient of the gradient always gives the desired forward-mode computation. However it is possible to implement the gradient of the gradient in a way that would result in extra computation, e.g. if the gradient of the gradient was written as a single monolithic op rather than decomposed into multiple ops.

Consider the simple case of an op f with one input and one output, say y = f(x). Then grad_x = g(grad_y, x), where g would often consist of a single op, and that's reasonable. Let's change variable names and write u = g(v, x). Here g is linear in v. Then (grad_v, grad_x_more) = h(grad_u, v, x). Mathematically the first component grad_v of the output will be the desired forward-mode output when grad_u is the forward-mode input (and will not depend on v). However if h is implemented as a single op which computes both grad_v and grad_x_more, or several ops but where the parent ops of grad_v include calculations not required to compute grad_v, then extra computation will be performed. I have no idea how prevalent one case is versus the other for gradient of gradient ops in current tensorflow.

Using tf.placeholder as @mattjj suggested seems to offer some protection against the case where unnecessary computation is done, since it seems unlikely there are many cases where grad_x_more doesn't depend on v. If grad_x_more or similar is inadvertently being computed then the attempt to access v will result in a "placeholder value must be specified" error at runtime.

Indeed grad_x_more is linear in v, and the only way grad_x_more can be independent of v for all grad_u and x is if the second derivative of the original function f vanishes everywhere, i.e. f is a linear-affine function of x, i.e. f(x) = A x + b for some fixed A and b (not inputs to the op f). This could conceivably be true for some ops (e.g. an FFT op), but these cases seem likely to be few and far between, and even in these cases the gradient of the gradient may still be implemented in an efficient way.

So it seems like using tf.placeholder means it's pretty likely that any unnecessary computation in the forward-mode computation will result in a "placeholder value must be specified" error.

On the line

return tf.gradients(g, v, grad_ys=d_xs)

in @mattjj's code bove, you can see that we tell TF to take the grad of g w.r.t. v. I think the grad w.r.t. x won't be computed — if TF always took gradients with respect to every parent node in a graph that could be very inefficient.

So it seems like using tf.placeholder means it's pretty likely that any unnecessary computation in the forward-mode computation will result in a "placeholder value must be specified" error.

There may be cases where this would happen, although to my mind that seems unlikely. TF should know that g is linear as a function of v and sever any connections in the graph. This has also been discussed on Theano/Theano#6035 and HIPS/autograd#175 (comment).

For more detail see my blog post: https://j-towns.github.io/2017/06/12/A-new-trick.html.

Thanks for sharing this great trick!

Regarding @j-towns's comment above, I have come across a case where tf complains about v being specified, but the value appears not to make a difference when I pass it in. The problem arises in using the cholesky (NB tensorflow/tensorflow@473a590 is required for this)

float_type = tf.float64
float_type_np = np.float64

def fwd_gradients_with_v0(ys, xs, d_xs, v0):
    v = tf.placeholder_with_default(v0, shape=ys.get_shape())
    g = tf.gradients(ys, xs, grad_ys=v)
    return tf.gradients(g, v, grad_ys=d_xs)

def test_equivalence(f_tf, f_np, X_val, U_val):
    X = tf.placeholder(float_type, X_val.shape)
    U = tf.placeholder(float_type, U_val.shape)
    Y = f_tf(X)
    
    Y_val = f_np(X_val)  
    v_zeros = np.zeros_like(Y_val, dtype=float_type_np)
    v_rand = np.array(npr.randn(*Y_val.shape), dtype=float_type_np)
        
    a_np = jacobian(f_np)(X_val) @ U_val  # slow way

    with tf.Session() as sess:
        assert np.allclose(sess.run(Y, {X: X_val}), Y_val)  # check np/tf agree 

        a_tf_zeros,  = sess.run(fwd_gradients_with_v0(Y, X, U, v_zeros),  feed_dict={X: X_val, U: U_val})
        a_tf_rand,  = sess.run(fwd_gradients_with_v0(Y, X, U, v_rand),  feed_dict={X: X_val, U: U_val})

        print('zeros: {}'.format(np.max(np.abs(a_np - a_tf_zeros))))
        print('rand: {}'.format(np.max(np.abs(a_np - a_tf_rand))))

        try: 
            a_tf, = sess.run(fwd_gradients(Y, X, U), feed_dict={X: X_val, U: U_val})
            print('dummy: {}'.format(np.max(np.abs(a_np - a_tf))))
        except tf.errors.InvalidArgumentError:
            print('dummy method failed')

With e.g. matrix inverse this all works fine:

N = 3
L = npr.randn(N, N)
X_val = (L @ L.T).flatten()
U_val = npr.randn(N**2)

f_tf = lambda X: tf.reshape(tf.matrix_inverse(tf.reshape(X, [N, N])), [N**2, ])
f_np = lambda X: np.linalg.inv(X.reshape(N, N)).flatten()
test_equivalence(f_tf, f_np, X_val, U_val)
zeros: 8.387360139749944e-09
rand: 8.387360139749944e-09
dummy: 8.387360139749944e-09

But with cholesky the dummy method fails:

f_tf = lambda X: tf.reshape(tf.cholesky(tf.reshape(X, [N, N])), [N**2, ])
f_np = lambda X: np.linalg.cholesky(X.reshape(N, N)).flatten()
test_equivalence(f_tf, f_np, X_val, U_val)
zeros: 1.5365486660812167e-12
rand: 1.5365486660812167e-12
dummy method failed

but the other two methods give the same answer. Is something fishy going on, or is this all fine?

EDIT: reading @MattShannon 's comment further up I realise this is probably to do with the way cholesky grad is implemented, and that there is likely some redundant computation, but it shouldn’t effect the result. My mind is at rest. Annoying to have to pass v0, though...

@hughsalimbeni I also met this problem where TF complains about placeholders when I have a list of variables for u - using tf.ones_like(ys) for v seems to work for me.

I believe I have identified a bug where this trick does not work for the tf.nn.elu activation. Computing the jacobian-vector product at some x < 0 gives the incorrect result only for the ELU operation.

Consider the following minimal example which computes the Jacobin numerically, with the vjp, and the jvp. Note that if you replace tf.nn.elu with my_relu, the results are correct.

If someone knows of a workaround to make this work for the built-in ELU I would greatly appreciate it, since using the my_elu function incurs a 2-3x reduction in performance.

import tensorflow as tf
import numpy as np

def fwd_gradients(ys, xs, d_xs):
    dummy = tf.zeros_like(ys)
    g = tf.gradients(ys, xs, grad_ys=dummy, name="gradients")
    return tf.gradients(g, dummy, grad_ys=d_xs, name="jvp")

def my_elu(x):
    return tf.where(x >= 0.0, x, tf.exp(x) - 1.0)

def main():
    print(tf.__version__)

    sess = tf.InteractiveSession()
    init = tf.global_variables_initializer()
    
    activation = tf.nn.elu # Works correctly tf.nn.relu (or any other non-elu activation)

    x_size = 1
    y_size = 1

    # Single ELU or RELU op
    X = tf.placeholder(tf.float64, shape=[x_size]) # Input
    Y = activation(X) # Output

    # Define vjp and jvp
    Vx = tf.placeholder(tf.float64, shape=[x_size]) # Jac-vector product input V
    Vy = tf.placeholder(tf.float64, shape=[y_size]) # vector-jac product input V
    jvp = fwd_gradients(Y, X, d_xs=Vx)
    vjp = tf.gradients(Y, X, grad_ys=Vy)

    # Compute jacobians
    x = np.ones(x_size) - 1.5 # Bug only occurs in x < 0 region
    tf_jac, numeric_jac = tf.test.compute_gradient(X, [x_size], Y, [y_size], x_init_value=x)
    vjp_jac = np.array([sess.run(vjp, feed_dict={X: x, Vy: v})[0] for v in np.identity(y_size)])
    jvp_jac = np.array([sess.run(jvp, feed_dict={X: x, Vx: v})[0] for v in np.identity(x_size)])

    # Print results as maximum absolute error
    print("Numeric jac:", numeric_jac)
    print("jvp jac:", jvp_jac)
    print("tf error:", np.max(np.abs(numeric_jac - tf_jac)))   # ~0.0
    print("vjp error:", np.max(np.abs(numeric_jac - vjp_jac))) # ~0.0
    print("jvp error:", np.max(np.abs(numeric_jac - jvp_jac))) # LARGE! for ELU

    sess.close()

if __name__ == '__main__':
    main()

Tested on tensorflow 1.4.0, 1.8.0, and latest nightly 1.9.0-dev20180515

@teaghan I guess if the inputs is 100-dim and output is 10-dim, it maybe more efficient just use back-propagation? Since for forward-mode, each input dimension needs a separate differentiation path.

@renmengye by back-propagation do you mean just use the normal tf.gradients(y, x)? I have been doing this, but I was hoping for a faster method as this is quite expensive with a large model and a large number of inputs and outputs (the example shown is a simplification). Looking at @zero-impact's example, would running jvp_jac = np.array([sess.run(jvp, feed_dict={X: x, Vx: v})[0] for v in np.identity(x_size)[x_indices]) make any sense? where x_indices are my features of interest for the inputs.

@teaghan If you know which input feature dimension you are interested in, then it makes sense to use forward mode. The u_val is the gradients towards the inputs. If you know which scalar of input is to take gradients, then you can leave u_val to be None. This is similar to, when you do backpropagation, the optional gradients to be fed on the output.

I believe there are many function primitives that the implementation using placeholder for v would fail. For example, by using the exactly the same code as shown by @mattjj with tf1.13, many functions fails complaining placeholder not specified. Pass cases include tanh (the original post), exp, sigmoid, pow while failing cases include tan, sin, cos, cosh, sinh, log and many more such as nn.elu as mentioned above. Therefore, the implementation using placeholder is not very robust unless you can make sure no primitive fails in your codebase.

Of course, one can specify something like v=tf.ones_like(ys) and get the correct result for the gradients. But I guess in such cases, the two rounds of reverse AD are actually executed by tensorflow and thus nothing gains. I guess the reason that placeholder approach fails is because the linearity with v is only guaranteed in theory. In practice, Jacobian product is a complicated function and possibly for some of the function primitives, the expected linearity with v are disguised by their backprop function leaving tensorflow failing identifing the linearity between g and v and thus failing in optimizing the final graph.

Why did @mattjj pass <npr.randn(1, 5)> for <grad_ys> in the in this function? As per tensorflow documentation
"grad_ys is a list of tensors of the same length as ys that holds the initial gradients for each y in ys. When grad_ys is None, we fill in a tensor of '1's of the shape of y for each y in ys. A user can provide their own initial grad_ys to compute the derivatives using a different initial gradient for each y (e.g., if one wanted to weight the gradient differently for each value in each y)."

Correct me if I am wrong but I believe this means that if we pass a random vector instead of all 1s for <grad_ys>, it acts as a weighting function. I checked on a small example and it indeed is true:

tf.reset_default_graph()
A = tf.constant(npr.randn(5, 3), dtype=tf.float32)
B = tf.constant(npr.randn(3, 3), dtype=tf.float32)
x = tf.placeholder(tf.float32, [1, 5])
y = tf.nn.relu(tf.matmul(tf.tanh(tf.matmul(x, A)),B))
u = tf.placeholder(tf.float32, [1, 5])

jvp = fwd_gradients(y, x, u)
x_val = npr.randn(1, 5)
u_val = np.ones((1, 5))
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init_op)
    jac_man = (sess.run(jvp, feed_dict={x: x_val, u: u_val}))
    jac_jvp = (sess.run(tf.reduce_sum(jacobian(y, x),axis=2), feed_dict={x: x_val}))
    print('------------------------------------')
    print('forward jacobian calculated manually is: {}'.format(jac_man))
    print('forward jacobian calculated using power iteration method is: {}'.format(jac_jvp.T))

Here jacobian(y,x) is a function that evaluates the jacobian matrix of a given function.
In the code above, if I pass npr.randn(1,5) instead of np.ones((1,5)) for u, I get incorrect values for forward gradients.

conal commented

@renmengye wrote:

(although from implementation perspective, forward-mode should be a lot easier than reverse-mode)

Only because people have been thinking about AD in an awkward way (in terms of graphs and “tensors”), including seeing reverse mode as involving a forward pass, reverse pass, mutation, “tapes” etc. The simple essence of automatic differentiation shows that reverse mode AD is not only as simple as forward mode but that they’re both special cases of the same simple, general algorithm.

@jaweriaamjad it depends on what you want to compute. tf.gradients computes vector-Jacobian products u^T (dy/dx), and this trick computes Jacobian-vector products (dy/dx) u. The former is "how does a change in x affect y in the direction of u", the latter is "how does a change in x in the direction of u affect y?". You can set u to all ones, but there is nothing correct about that particular value. It just gets you the sum of the columns of the Jacobian.

The JVP/VJP parlance wasn't clear to me initially, but I think it's because we're so used to the typical use case of reverse-differentiating a scalar output. For example if y is a batch of losses, we tend to average it and then differentiate: tf.gradients(tf.reduce_mean(y), x). But you could equally well differentiate the batched losses and pass a uniform weighting vector for each of the losses, e.g. tf.gradients(y, x, grad_ys=tf.ones_like(y) / tf.size(y)). This is a more general view because it also applies the other way around (i.e. fwd_gradients above with the equivalent of a grad_xs argument).

@cooijmanstim so if I understand correctly, we will indeed pass all ones for <u> if we wish to calculate the sum of the columns of jacobian of loss functions during training (since usually we don't want a weighted sum). The purpose of <grad_ys> is to give us the freedom to calculate the weighted sum.

@jaweriaamjad right, but to be clear, summing the columns means summing across x, not across y. It will not correspond to summing losses but to summing parameters. The shape of the output will be the same as that of y. To sum across y you want to use reverse-mode.

Also, weighted sum is a particular case. More generally, grad_ys is used to implement the chain rule. Consider the ith layer of a feedforward neural net computes output h[i] from h[i-1], and further down the line this results in some loss L. Then if you have dL/dh[i] (call it dL_dh[i] to get a valid Python identifier), you can recursively go back, computing the gradient dL/dh[i-1] by dL_dh[i-1] = tf.gradients(h[i], h[i-1], grad_ys=dL_dh[i]). This is backprop.