arxyzan/data2vec-pytorch

Trouble with audio....?

drscotthawley opened this issue · 2 comments

Aryan, thank you very much for sharing your code with the world. I wonder if you could advise:

I am trying to train by following the instructions for audio, but I haven't been able to get TIMIT or LibriSpeech to work.

TIMIT

For TIMIT, I get the message from HuggingFace that it must be downloaded manually. From the URL provided in the message, I got to UPenn who apparently want $250? for the dataset?? ...So, ok, I obtained a copy from a friend and also from Kaggle. But in both cases the HF dataloader fails; it is looking for files that don't exist anywhere in the dataset: it is looking for files with lower-case letters like "*test" (all the filenames in both my copies are uppercase) and certain file extensions that exclude the .DOC which is provided in TIMIT:

Error message

  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/datasets/data_files.py", line 201, in resolve_patterns_locally_or_by_urls
    raise FileNotFoundError(error_msg)
FileNotFoundError: Unable to resolve any data file that matches '['**test*', '**eval*']' at /home/ubuntu/datasets/timit with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'blp', 'bmp', 'dib', 'bufr', 'cur', 'pcx', 'dcx', 'dds', 'ps', 'eps', 'fit', 'fits', 'fli', 'flc', 'ftc', 'ftu', 'gbr', 'gif', 'grib', 'h5', 'hdf', 'png', 'apng', 'jp2', 'j2k', 'jpc', 'jpf', 'jpx', 'j2c', 'icns', 'ico', 'im', 'iim', 'tif', 'tiff', 'jfif', 'jpe', 'jpg', 'jpeg', 'mpg', 'mpeg', 'msp', 'pcd', 'pxr', 'pbm', 'pgm', 'ppm', 'pnm', 'psd', 'bw', 'rgb', 'rgba', 'sgi', 'ras', 'tga', 'icb', 'vda', 'vst', 'webp', 'wmf', 'emf', 'xbm', 'xpm', 'zip']

The files look like

³       PHONCODE.DOC
³       PROMPTS.TXT
³       SPKRINFO.TXT
³       SPKRSENT.TXT
³       TESTSET.DOC

If I take away the 'clean' directive in the load_dataset call, then the dataset loads but fails later with a key error:

Epoch: 1/1000     0%|                                                                      | 0/31678 [00:00<?, ?batch/s]
Traceback (most recent call last):
  File "/home/ubuntu/shawley/data2vec-pytorch/train.py", line 25, in <module>
    trainer.train()
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/trainer.py", line 142, in train
    train_loss = self.train_epoch(epoch)
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/trainer.py", line 106, in train_epoch
    for batch in iterator:
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 570, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/dataset.py", line 21, in __getitem__
    x = self.data[index]['audio']
KeyError: 'audio'

If I print out self.data just after it's loaded in your TIMIT class, there is no 'audio' part to the namespace:

print("self.data[0] = ",self.data[0])
self.data =  {'index': 1, 'test_or_train': 'TRAIN', 'dialect_region': 'DR4', 'speaker_id': 'MMDM0', 'filename': 'SI681.WAV.wav', 'path_from_data_dir': 'TRAIN/DR4/MMDM0/SI681.WAV.wav', 'path_from_data_dir_windows': 'TRAIN\\\\DR4\\\\MMDM0\\\\SI681.WAV.wav', 'is_converted_audio': True, 'is_audio': True, 'is_word_file': False, 'is_phonetic_file': False, 'is_sentence_file': False}

Are you able to comment or advise about getting TIMIT to work?

LibriSpeech

For LibriSpeech, I copied your TIMIT class in dataset.py and just hard-coded the name of the dataset:

class LibriSpeech(Dataset):
    def __init__(self, cfg, split, **kwargs):
        super(LibriSpeech, self).__init__()
        path = cfg.dataset.path
        #self.data = load_dataset(path, 'clean')[split]
        self.data = load_dataset("librispeech_asr", 'clean')
        #print("self.data = ",self.data)
        self.data = self.data[split]
        self.feature_extractor = Wav2Vec2FeatureExtractor(cfg.model.encoder_checkpoint)
        self.__dict__.update(kwargs)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        x = self.data[index]['audio']
        x = self.feature_extractor(x['array'], sampling_rate=x['sampling_rate'], padding=True, return_tensors='pt')['input_values']
        return {'input_values': x[0]}
       

And then in trainer.py I just wrote

        #self.train_dataset = TIMIT(cfg, 'train')
        #self.test_dataset = TIMIT(cfg, 'test')
        self.train_dataset = LibriSpeech(cfg, 'train.100')
        self.test_dataset = LibriSpeech(cfg, 'test')

In that case the data is loaded without errors, and the training begins but aborts with a series of CUDA errors:

Epoch: 1/1000     0%|                                                                      | 0/28539 [00:00<?, ?batch/s]../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [5550,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [5550,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [5550,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

...hundreds more lines like this, and then...

Epoch: 1/1000     0%|                                                                      | 0/28539 [00:04<?, ?batch/s]
Traceback (most recent call last):
  File "/home/ubuntu/shawley/data2vec-pytorch/train.py", line 25, in <module>
    trainer.train()
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/trainer.py", line 142, in train
    train_loss = self.train_epoch(epoch)
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/trainer.py", line 107, in train_epoch
    loss = self.train_step(batch)
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/trainer.py", line 65, in train_step
    x, y = self.model(src, src, mask)
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/shawley/data2vec-pytorch/data2vec/data2vec.py", line 83, in forward
    x = self.encoder(src, mask, **kwargs)['encoder_out']  # fetch the last layer outputs
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/encoder.py", line 35, in forward
    outputs = self.encoder(inputs, mask_time_indices=mask, output_hidden_states=True,
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1357, in forward
    hidden_states = self._mask_hidden_states(
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1297, in _mask_hidden_states
    hidden_states[mask_time_indices] = self.masked_spec_embed.to(hidden_states.dtype)
RuntimeError: CUDA error: device-side assert triggered

Do you have a suggestion about getting LibreSpeech working?

Thanks again,
Soctt

Hello Scott and thanks for reaching out and sorry for the delayed response.
Alright, it seems that TIMIT cannot be freely downloaded from HuggingFace Hub anymore and you have to download it manually (with payment!).
Unfortunately I couldn't test this code using Librispeech as it was too large for my machine and network bandwidth! but your problem seems to be an incompatibility between number of features in the output of your model and the number of features passed to the loss function. Can you put a breakpoint in the trainer.py's train_step() right before calculating loss and check the shape of x and y at line 60?

@drscotthawley
Hello Scott, I hope you're doing well.
Just reached out to ask if your issue is resolved or not?