Question about gradient calculation in the quantizer

Question

Question about gradient calculation in the quantizer

Opened this issue 6 years ago · 5 comments

I found that you have used "tf.stop_gradient()" to deal with the nondifferentiable properties of "tf.argmin()", "tf.round()". However, "tf.stop_gradient()" is used to ignore the gradient contirbution of present node, which means that your encoder network will not update its parameters since all the nodes before the quantizer (specifically the encoder network) will be ignored in the gradient calculation.
Are you tring to make the quantizer have a fixed gradient (such as "1") value at any time? If you are, I think you have to re-define the gradient of the quantizer rather than use "tf.stop_gradient()" .

Answer 1 · 2019-01-25T03:45:34.000Z

I will look into the the first one, can you give more details about the second point?

Answer 2 · 2021-05-25T02:29:40.000Z

I wonder how to deal with the gradient when I apply the operations such as "tf.round". Will setting the gradient of these operation to be 1 help? Could you offer some references? Thank you very much!

Answer 3 · 2021-05-25T03:19:24.000Z

IIRC, I manually overrode the gradient so that it is just the identity - this is the 'straight-through' estimator that works surprisingly well in practice.

…

On Tue, May 25, 2021 at 12:29 PM sun107 ***@***.***> wrote: I wonder how to deal with the gradient when I apply the operations such as "tf.round". Will setting the gradient of these operation to be 1 help? Could you offer some references? Thank you very much! — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGRNY6GLUT5LV2GC6A74KQTTPMDSBANCNFSM4GSASWGA> .

Answer 4 · 2021-05-28T07:54:16.000Z

Really thanks for your answer. I've got another problem. I wonder how does batchsize affect the result. Have you ever tried a bigger batch size? Thanks for your reply again!

Answer 5 · 2021-05-28T10:46:43.000Z

The batch size is limited by GPU memory. In recent papers about learned image compression the batch size is usually set to something low like 8 or 16, so I imagine that other hyperparameters would be more important to tune.