What is the latest known pytorch version to train for speaker adaptation on Windows10?
RaghothamRao opened this issue · 1 comments
Hi,
Just wanted to give some background before i raised this issue.
Background:
- Windows 10 machine with GeForce GTX 1060 (6GB) GPU
- Everytime, i created a new python 3.6 conda environment and tried different combinations of pytorch & cudatoolkit installations to see if the code on particular git commit or master worked.
- I initially failed to train using the pytorch version 1.4 (with cudatoolkit 9) to adapt to a speaker from the trained LJ speech model. [Code used was from git commit "abf0a21f83aeb451b918f867bc23378f1e2e608b"]
- Later, i learned from the issue #173 that pytorch 1.1 with cuda10 works. However, i tried this with cuda 9 and was able to run training for the first time for few of my custom voice samples. On subsequent times, i used to get "RuntimeError: CUDA error: unknown error" and tried rebooting several times. Finally to fix, i had put "torch.cuda.current_device()" after "import torch" in train.py file as per pytorch/pytorch#21114. The error was gone, but i got some or the other errors (as highlighted below) and was no luck since then with any of the below combinations of pytorch & cudatoolkit.
- Could not try using pytorch 1.3 as it seems not available in https://pytorch.org/ under current versions and previous versions.
Few pytorch version-cudatoolkit combinations and errors:
-
pytorch 1.4 & cuda 9.2 (using code on git commit)
File "train.py", line 983, in
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 589, in train
in tqdm(enumerate(data_loader)):
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\tqdm\std.py", line 1107, in iter
for obj in iterable:
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 345, in next
data = self._next_data()
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 856, in _next_data
return self._process_data(data)
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 881, in _process_data
data.reraise()
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch_utils.py", line 394, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 31, in _pin_memory_loop
data = pin_memory(data)
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in
return [pin_memory(sample) for sample in data]
File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 47, in pin_memory
return data.pin_memory()
RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation. -
pytorch 1.4 & cuda 9.2 (using code on master branch)
File "train.py", line 1017, in
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 723, in train
priority_w=hparams.priority_freq_weight)
File "train.py", line 557, in spec_loss
l1_loss = w * masked_l1(y_hat, y, mask=mask) + (1 - w) * l1(y_hat, y)
File "C:\ProgramData\Anaconda3\envs\DV3_pip\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "train.py", line 290, in forward
loss = self.criterion(input * mask_, target * mask_)
RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2 -
With pytorch 1.1.0 & torchvision 0.3.0 and cudatoolkit 9 (with master as well as particular git commit)
RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2 -
With pytorch==1.2.0, torchvision==0.4.0 cudatoolkit=10.0 (with code on git commit)
RuntimeError: reduce failed to synchronize: device-side assert triggered -
With pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch (with master)
RuntimeError: Expected object of scalar type Float but got scalar type Long for argument #2 'other' -
conda install pytorch==1.0.0 torchvision==0.2.1 cuda80 -c pytorch (on git commit)
RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2
Could someone kindly advise on the pytorch, cudatoolkit combination that this code with LJspeech pre-trained model works with?
An update:
Tried on Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0-1095-aws x86_64v) as well with pytorch 1.1.0 & torchvision 0.3.0 and cudatoolkit 10. No Luck yet on training LJSpeech pretrained model for speaker adaptation.
Traceback (most recent call last):
File "train.py", line 984, in
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 689, in train
priority_w=hparams.priority_freq_weight)
File "train.py", line 523, in spec_loss
l1_loss = w * masked_l1(y_hat, y, mask=mask) + (1 - w) * l1(y_hat, y)
File "/home/ubuntu/anaconda3/envs/pytorch1_1_cuda10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "train.py", line 292, in forward
loss = self.criterion(input * mask_, target * mask_)
RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2