argmax between first encoder and second decoder

Question

argmax between first encoder and second decoder

Closed this issue 4 years ago · 7 comments

You are using argmax between the first encoder and second decoder. How is the back propagation happening since argmax is non-differentiable ?

Answer 1 · 2019-12-11T05:52:03.000Z

In contrast to the original Deliberation Network (Xia et al., 2017), where they propose a joint learning framework using Monte Carlo Method, we don't perform backpropagation from the second decoder to the first decoder as "Modeling coherence for discourse neural machine translation" does.

Answer 2 · 2019-12-11T12:15:58.000Z

So you are updating weights of both the first decoder and second decoder using their individual loss separately ?

Then why are you detaching the first decoder at the end of every decoder time step?

Answer 3 · 2019-12-12T08:12:25.000Z

So you are updating weights of both the first decoder and second decoder using their individual loss separately ?

Then why are you detaching the first decoder at the end of every decoder time step?

Yes, we update the weights using their individual loss separately.
For the second problem, please provide some details.

Answer 4 · 2019-12-17T07:32:39.000Z

When computing the gradients using loss.backward() for the second decoder, gradients of the encoder should also change ?

I tried printing the gradients of some of the encoder parameters but the gradients did not change.

            print('first_decoder')
            batch_stats1 = self.train_loss.sharded_compute_loss(
                batch, first_outputs, first_attns, j,
                trunc_size, self.shard_size, normalization)

            for name, param in self.model.named_parameters():
                if 'encoder.htransformer.layer_norm.weight' in name:
                    print(param.grad)

            print('second_decoder')

            batch_stats2 = self.train_loss.sharded_compute_loss(
                batch, second_outputs, second_attns, j,
                trunc_size, self.shard_size, normalization)

            for name, param in self.model.named_parameters():
                if 'encoder.htransformer.layer_norm.weight' in name:
                    print(param.grad)

In such type of cases how do you ensure proper back propagation?

Answer 5 · 2019-12-17T09:14:10.000Z

The context encoder doesn't change because the second decoder only uses the knowledge representation and the first-pass output representation. There is no grad propagate to the context encoder (encoder.htransformer).

Answer 6 · 2019-12-18T07:49:14.000Z

Thanks for your response. I understood a lot of things. BTW have you ever used gumbel softmax ?

Answer 7 · 2019-12-18T07:56:13.000Z

No, I haven't used the gumbel softmax. But I think it will work. You can have a try.