Backproping Through Argmax

Question

Backproping Through Argmax

Opened this issue 9 years ago · 7 comments

Hey Liamb, really appreciate the work you're doing here.

For a long time, I've wanted to apply Adversarial networks to NLP. The main problem is that in a seq2seq generator, you have to use the argmax to predict the next word.

The problem with this is that you can't backprop through the argmax function -- it is non-differentiable. I thought of feeding the discriminator a direct embedding and doing nearest neighbors to predict best char/word.

I noticed from your commits that you have had some instability. If you want to discuss this further, I would be happy to chat with you on skype. My username is 'leavesbreathe'. This is an exciting idea you have. I'm convinced this is the way to go after studying this paper:

http://arxiv.org/abs/1511.05101

Thanks!

Answer 1 · 2016-05-04T21:04:34.000Z

Hey there and thanks for the message!

Yes, this non-differentiability in the model presents considerable
challenges. Whether one chooses to argmax or to sample from the
probability distributions, you've immediately complicated the training.
However, one way I'm considering attempting to skirt this challenge is as
follows,

At each step of the seq2seq generator, sample or argmax to produce the next
token to feed into the generator, however, instead of feeding in that token
to the discriminator, pass the entire softmax distribution to the
discriminator at each step. From this architecture, you can then calculate
gradients w.r.t. each token in the vocabulary. It should be noted that
this is an odd thing to do, though. Now the discriminator will be
evaluating not a sequence of tokens, but rather, a sequence of probability
distributions. Furthermore, the discriminator has no mechanism to see
which token was chosen from the distribution at each time step by the
generator, in order to produce this sequence of probability distributions.

This seems like a potentially related idea to your passing of the direct
embedding.

Would certainly like to Skype and discuss further, my username is the same
as my GitHub: liamb315. When would be good for you? I have a pretty
remarkable availability recently since I'm presently recovering from ACL
surgery, so I can make most reasonable PST times work.

Talk to you soon,
Liam

On Wed, May 4, 2016 at 8:56 AM, LeavesBreathe notifications@github.com
wrote:

Hey Liamb, really appreciate the work you're doing here.

For a long time, I've wanted to apply Adversarial networks to NLP. The
main problem is that in a seq2seq generator, you have to use the argmax to
predict the next word.

The problem with this is that you can't backprop through the argmax
function -- it is non-differentiable. I thought of feeding the
discriminator a direct embedding and doing nearest neighbors to predict
best char/word.

I noticed from your commits that you have had some instability. If you
want to discuss this further, I would be happy to chat with you on skype.
My username is 'leavesbreathe'. This is an exciting idea you have. I'm
convinced this is the way to go after studying this paper:

http://arxiv.org/abs/1511.05101

Thanks!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#4

William Fedus

Answer 2 · 2016-05-04T23:28:03.000Z

I agree with you that the best step is to pass the argmax for the generator to produce the next word/char.

Passing the entire softmax distribution would be very tricky though. Suppose your vocabulary was 40k words (which is a small vocab size).

The discriminator then has to consider 40k inputs per timestep, which is pretty considerable. You could think of it as a 40,000 word embedding in a way. Now with chars it would be different, as you would only have 100 or 150 vectorization.

But the problem with chars is simply that humans don't think by letter, they think by phrase or word. There are plenty of people who speak fluent English who can't spell a single word. The point is that when you generate content char by char, you're making the task incredibly difficult.

Added you on skype -- would be happy to talk! Really interesting stuff for sure 👍

Answer 3 · 2016-05-05T17:19:17.000Z

Definitely, character-level certainly is making the task considerably more difficult and I agree that it almost certainly doesn't match any sort of cognitive behavior (for instance, setnences are stlil otfen redabale even if letetrs are jubmled). Additionally, operating at character-level also lengthens the time scale that the RNN architectures need to store information and back-propagate gradients; in English, this is approximately a lengthening factor of ~5.

However, one nice gain is that we don't require a dedicated GPU just to handle a 100k word softmax layer. Also, I think it's quite interesting while operating on inputs near the base-level of any hierarchical structure (characters, pixels, sound-pressures, etc.), to see the extent to which useful hierarchical information may be implicitly learned via training.

I'll be on Skype most of the day, feel free to ping me whenever! Look forward to hearing your thoughts on this.

Answer 4 · 2016-05-05T20:02:11.000Z

Hey Liam, I trying resending you a contact request so I think you should be added? I've sent you a few messages. Just message me on skype and we can go from there. Perhaps I'm doing something wrong. Again its "leavesbreathe"

I agree that hierarchy is nice, but at the same time, practicality is important. There are subword neural nets may be the best way to go as a compromise.

You don't need a dedicated GPU for a regular softmax of 40k words. You can use tricks like a sampled softmax or hierarchical softmax which I believe resemble more of human intuition. Make it much less expensive 👍

Answer 5 · 2016-05-05T20:46:12.000Z

Odd, I've already confirmed the request but don't see any messages from
you. I suppose you don't see my message back, either? Let me try a voice
call, perhaps that might work.

Ah and interesting, I need to look further into sampled softmax and
hierarchical softmax!

On Thu, May 5, 2016 at 1:02 PM, LeavesBreathe notifications@github.com
wrote:

Hey Liam, I trying resending you a contact request so I think you should
be added? I've sent you a few messages. Just message me on skype and we can
go from there. Perhaps I'm doing something wrong. Again its "leavesbreathe"

I agree that hierarchy is nice, but at the same time, practicality is
important. There are subword neural nets may be the best way to go as a
compromise.

You don't need a dedicated GPU for a regular softmax of 40k words. You can
use tricks like a sampled softmax or hierarchical softmax which I believe
resemble more of human intuition. Make it much less expensive [image:
👍]

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#4 (comment)

William Fedus

Answer 6 · 2016-09-06T14:47:46.000Z

Reinforcement learning, or specifically, REINFORCE, may be a compelling route forward to deal with non-differentiable operations in the graph. I'll keep you posted as I develop experiments.

Answer 7 · 2016-09-07T12:31:22.000Z

Hey Will, I think you'er right that reinforce could potentially work really well. However, the biggest problem might be potential action space as others have noted in generation with reinforce.