Gradient Penalty

Question

Gradient Penalty

Opened this issue a year ago · 0 comments

This method uses one main estimator and n adapters, one for each model to be estimated. The adapter is necessary to count for different preprocessing methods of the models. The adapter's objective is to transform the preprocessed inputs into one same latent space. However, gradient penalty calculation requires input and loss pair for each adapter, which means n forward passes is required, which would multiply the training time. Furthermore, n backward passes for different adapters may make it harder for the model to converge. Thus, one role model adapter is chosen (hyperparameter) and the rest non-role-model adapter objective are to produce the same intermediate latent representation, because the inputs are the same. The loss is called embed loss. However, gradient penalty calculation still requires input and loss pair for each adapter, meanwhile the non-role-model adapters never make it to the loss. Thus, the following are proposed:

Only do 1 gradient penalty, that is for the role model
Estimate gradient penalty for non-role-model adapters manually by estimating gradient at the intermediate tensor. We use role model intermediate tensor gradient + non role model embed loss gradient. The logic is that gradient is where the model parameter should move to, and the adapter role model should move towards the role model representation + where the role model should move to.
Calculate gradient penalty for non-role-model adapters by averaging the intermediate tensors and doing forward pass on it. This means we're doing two forward passes, but it's still better than n passes. We obtain the gradient of the individual intermediate tensor by autograd. Lastly, we may multiply it by the non-role-model count or leave it be.