r9y9/deepvoice3_pytorch

What is the latest known pytorch version to train for speaker adaptation on Windows10?

RaghothamRao opened this issue · 1 comments

Hi,
Just wanted to give some background before i raised this issue.
Background:

  1. Windows 10 machine with GeForce GTX 1060 (6GB) GPU
  2. Everytime, i created a new python 3.6 conda environment and tried different combinations of pytorch & cudatoolkit installations to see if the code on particular git commit or master worked.
  3. I initially failed to train using the pytorch version 1.4 (with cudatoolkit 9) to adapt to a speaker from the trained LJ speech model. [Code used was from git commit "abf0a21f83aeb451b918f867bc23378f1e2e608b"]
  4. Later, i learned from the issue #173 that pytorch 1.1 with cuda10 works. However, i tried this with cuda 9 and was able to run training for the first time for few of my custom voice samples. On subsequent times, i used to get "RuntimeError: CUDA error: unknown error" and tried rebooting several times. Finally to fix, i had put "torch.cuda.current_device()" after "import torch" in train.py file as per pytorch/pytorch#21114. The error was gone, but i got some or the other errors (as highlighted below) and was no luck since then with any of the below combinations of pytorch & cudatoolkit.
  5. Could not try using pytorch 1.3 as it seems not available in https://pytorch.org/ under current versions and previous versions.

Few pytorch version-cudatoolkit combinations and errors:

  1. pytorch 1.4 & cuda 9.2 (using code on git commit)
    File "train.py", line 983, in
    train_seq2seq=train_seq2seq, train_postnet=train_postnet)
    File "train.py", line 589, in train
    in tqdm(enumerate(data_loader)):
    File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\tqdm\std.py", line 1107, in iter
    for obj in iterable:
    File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 345, in next
    data = self._next_data()
    File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 856, in _next_data
    return self._process_data(data)
    File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 881, in _process_data
    data.reraise()
    File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch_utils.py", line 394, in reraise
    raise self.exc_type(msg)
    RuntimeError: Caught RuntimeError in pin memory thread for device 0.
    Original Traceback (most recent call last):
    File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 31, in _pin_memory_loop
    data = pin_memory(data)
    File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in pin_memory
    return [pin_memory(sample) for sample in data]
    File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in
    return [pin_memory(sample) for sample in data]
    File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in pin_memory
    return [pin_memory(sample) for sample in data]
    File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in
    return [pin_memory(sample) for sample in data]
    File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 47, in pin_memory
    return data.pin_memory()
    RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.

  2. pytorch 1.4 & cuda 9.2 (using code on master branch)
    File "train.py", line 1017, in
    train_seq2seq=train_seq2seq, train_postnet=train_postnet)
    File "train.py", line 723, in train
    priority_w=hparams.priority_freq_weight)
    File "train.py", line 557, in spec_loss
    l1_loss = w * masked_l1(y_hat, y, mask=mask) + (1 - w) * l1(y_hat, y)
    File "C:\ProgramData\Anaconda3\envs\DV3_pip\lib\site-packages\torch\nn\modules\module.py", line 532, in call
    result = self.forward(*input, **kwargs)
    File "train.py", line 290, in forward
    loss = self.criterion(input * mask_, target * mask_)
    RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2

  3. With pytorch 1.1.0 & torchvision 0.3.0 and cudatoolkit 9 (with master as well as particular git commit)
    RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2

  4. With pytorch==1.2.0, torchvision==0.4.0 cudatoolkit=10.0 (with code on git commit)
    RuntimeError: reduce failed to synchronize: device-side assert triggered

  5. With pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch (with master)
    RuntimeError: Expected object of scalar type Float but got scalar type Long for argument #2 'other'

  6. conda install pytorch==1.0.0 torchvision==0.2.1 cuda80 -c pytorch (on git commit)
    RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2

Could someone kindly advise on the pytorch, cudatoolkit combination that this code with LJspeech pre-trained model works with?

An update:
Tried on Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0-1095-aws x86_64v) as well with pytorch 1.1.0 & torchvision 0.3.0 and cudatoolkit 10. No Luck yet on training LJSpeech pretrained model for speaker adaptation.

Traceback (most recent call last):
File "train.py", line 984, in
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 689, in train
priority_w=hparams.priority_freq_weight)
File "train.py", line 523, in spec_loss
l1_loss = w * masked_l1(y_hat, y, mask=mask) + (1 - w) * l1(y_hat, y)
File "/home/ubuntu/anaconda3/envs/pytorch1_1_cuda10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "train.py", line 292, in forward
loss = self.criterion(input * mask_, target * mask_)
RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2