What is the latest known pytorch version to train for speaker adaptation on Windows10?

Question

What is the latest known pytorch version to train for speaker adaptation on Windows10?

RaghothamRao opened this issue 5 years ago · 1 comments

Hi,
Just wanted to give some background before i raised this issue.
Background:

Windows 10 machine with GeForce GTX 1060 (6GB) GPU
Everytime, i created a new python 3.6 conda environment and tried different combinations of pytorch & cudatoolkit installations to see if the code on particular git commit or master worked.
I initially failed to train using the pytorch version 1.4 (with cudatoolkit 9) to adapt to a speaker from the trained LJ speech model. [Code used was from git commit "abf0a21f83aeb451b918f867bc23378f1e2e608b"]
Later, i learned from the issue #173 that pytorch 1.1 with cuda10 works. However, i tried this with cuda 9 and was able to run training for the first time for few of my custom voice samples. On subsequent times, i used to get "RuntimeError: CUDA error: unknown error" and tried rebooting several times. Finally to fix, i had put "torch.cuda.current_device()" after "import torch" in train.py file as per pytorch/pytorch#21114. The error was gone, but i got some or the other errors (as highlighted below) and was no luck since then with any of the below combinations of pytorch & cudatoolkit.
Could not try using pytorch 1.3 as it seems not available in https://pytorch.org/ under current versions and previous versions.

Few pytorch version-cudatoolkit combinations and errors:

pytorch 1.4 & cuda 9.2 (using code on git commit)
File "train.py", line 983, in
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 589, in train
in tqdm(enumerate(data_loader)):
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\tqdm\std.py", line 1107, in iter
for obj in iterable:
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 345, in next
data = self._next_data()
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 856, in _next_data
return self._process_data(data)
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 881, in _process_data
data.reraise()
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch_utils.py", line 394, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 31, in _pin_memory_loop
data = pin_memory(data)
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 47, in pin_memory
return data.pin_memory()
RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.
pytorch 1.4 & cuda 9.2 (using code on master branch)
File "train.py", line 1017, in
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 723, in train
priority_w=hparams.priority_freq_weight)
File "train.py", line 557, in spec_loss
l1_loss = w * masked_l1(y_hat, y, mask=mask) + (1 - w) * l1(y_hat, y)
File "C:\ProgramData\Anaconda3\envs\DV3_pip\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "train.py", line 290, in forward
loss = self.criterion(input * mask_, target * mask_)
RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2
With pytorch 1.1.0 & torchvision 0.3.0 and cudatoolkit 9 (with master as well as particular git commit)
RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2
With pytorch==1.2.0, torchvision==0.4.0 cudatoolkit=10.0 (with code on git commit)
RuntimeError: reduce failed to synchronize: device-side assert triggered
With pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch (with master)
RuntimeError: Expected object of scalar type Float but got scalar type Long for argument #2 'other'
conda install pytorch==1.0.0 torchvision==0.2.1 cuda80 -c pytorch (on git commit)
RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2

Could someone kindly advise on the pytorch, cudatoolkit combination that this code with LJspeech pre-trained model works with?

Answer 1 · 2020-02-21T06:33:30.000Z

An update:
Tried on Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0-1095-aws x86_64v) as well with pytorch 1.1.0 & torchvision 0.3.0 and cudatoolkit 10. No Luck yet on training LJSpeech pretrained model for speaker adaptation.

Traceback (most recent call last):
File "train.py", line 984, in
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 689, in train
priority_w=hparams.priority_freq_weight)
File "train.py", line 523, in spec_loss
l1_loss = w * masked_l1(y_hat, y, mask=mask) + (1 - w) * l1(y_hat, y)
File "/home/ubuntu/anaconda3/envs/pytorch1_1_cuda10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "train.py", line 292, in forward
loss = self.criterion(input * mask_, target * mask_)
RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2