arxyzan/data2vec-pytorch

In data2vec.py

Closed this issue · 4 comments

In data2vec.py, in line 90,

y = self.ema.model(trg, ~mask, **kwargs)['encoder_states']

shouldn't it have been,

y = self.ema.model(trg, None, **kwargs)['encoder_states'] (going by the training strategy in the paper)?

Hello @HarshavardhanaTG ,
As far as I remember, the EMA model must take the ~mask. You can also verify this in the original fairseq implementation (V1 only)

Hey @arxyzan, thank you so much for replying so quickly. Your repository has been of huge help!
with torch.no_grad():
self.ema.model.eval()

        if self.cfg.ema_transformer_only:
            y, layer_results = self.ema.model.extract_features(
                pre_encoder_features,
                padding_mask=padding_mask,
                min_layer=self.cfg.encoder_layers - self.average_top_k_layers,
            )
            y = {
                "x": y,
                "padding_mask": padding_mask,
                "layer_results": layer_results,
            }
        else:
            y = self.ema.model.extract_features(
                source=source,
                padding_mask=orig_padding_mask,
                mask=False,
            )

        target_layer_results = [l[2] for l in y["layer_results"]]. 

I think they did fix that issue. It's totally possible that I am mistaken. Please let me know if I am wrong. I am a bit confused about this part, the rest of your repo seemed absolutely fine. Thanks again!

@HarshavardhanaTG Sorry for the late response. The code in the original implementation created extracted the mask in the forward method. But I decided to feed it in the dataset and as a parameter to the forward method. Either way is correct. The main thing to know here is that the original mask that is fed to the student model must be reversed and fed to the EMA (teacher) model.

Thank you so much! That helps!