arxyzan/data2vec-pytorch

Mask value overflowed in audio pre-training

LuJunru opened this issue · 2 comments

Hi @arxyzan,

I came up with a quite strange bug: the mask value overflowed in audio pretraining.

Here: https://github.com/arxyzan/data2vec-pytorch/blob/main/audio/encoder.py#L35. The mask is served as mask_time_indices during the computation of output hidden states.

The input mask is fine, a binary matrix (B, L):
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

However, after the computation, the mask value overflowed, like this:
1421638320, 1421638576, 1421638832, 1421639088, 1421639344, 1421639600, 1421639856, 1421640112,
1421640368, 1421640624, 1421640880, 1421641136, 1421641392, 1421641648, 1421641904, 1421642160...

Have you ever meet such issues? By the way, this only happens when run train.py. The debugging on audio/encoder.py only will not raise this bug.

My env is:
torch 1.13.1
torchaudio 0.13.1
transformers 4.26.0
python 3.8.16

thanks,
Junru

Hello @LuJunru, thanks for your feedback.
This is indeed a strange bug. I remember once I faced this problem in another project and changed the mask tensor from torch.int64 to torch.bool in encoder forward like so:

# model forward in audio/encoder.py
def forward(self, inputs, mask=None, **kwargs):
    """
    Forward inputs through the encoder and extract transformer/attention layers outputs
    Args:
        inputs: raw audio array
        mask: bool masked indices
        **kwargs: keyword args specific to the encoder's forward method
    Returns:
        A dictionary of the encoder outputs including transformer layers outputs and attentions outputs
    """
    mask = mask.bool()  #<< CHANGE DTYPE LIKE THIS>>
    outputs = self.encoder(inputs, mask_time_indices=mask, output_hidden_states=True,
                           output_attentions=True, **kwargs)
    encoder_states = outputs['hidden_states'][:-1]  # encoder layers outputs separately
    encoder_out = outputs['hidden_states'][-1]  # last encoder output (accumulated)
    attentions = outputs['attentions']
    return {
        'encoder_states': encoder_states,
        'encoder_out': encoder_out,
        'attentions': attentions
    }

Please try this and let me know if it's resolved.
Best,
Aryan

Hello @LuJunru, thanks for your feedback. This is indeed a strange bug. I remember once I faced this problem in another project and changed the mask tensor from torch.int64 to torch.bool in encoder forward like so:

# model forward in audio/encoder.py
def forward(self, inputs, mask=None, **kwargs):
    """
    Forward inputs through the encoder and extract transformer/attention layers outputs
    Args:
        inputs: raw audio array
        mask: bool masked indices
        **kwargs: keyword args specific to the encoder's forward method
    Returns:
        A dictionary of the encoder outputs including transformer layers outputs and attentions outputs
    """
    mask = mask.bool()  #<< CHANGE DTYPE LIKE THIS>>
    outputs = self.encoder(inputs, mask_time_indices=mask, output_hidden_states=True,
                           output_attentions=True, **kwargs)
    encoder_states = outputs['hidden_states'][:-1]  # encoder layers outputs separately
    encoder_out = outputs['hidden_states'][-1]  # last encoder output (accumulated)
    attentions = outputs['attentions']
    return {
        'encoder_states': encoder_states,
        'encoder_out': encoder_out,
        'attentions': attentions
    }

Please try this and let me know if it's resolved. Best, Aryan

Hi @arxyzan ,

Thank you for the quick response. I follow your advice and directly add .bool() in audio/dataset.py. It works.

best,
Junru