Run into TypeError while trying to train the model

Question

Run into TypeError while trying to train the model

AidenKitamura opened this issue 5 years ago · 3 comments

Hello! Currently I was following all the steps in the readme file. While running the model training part I ran into some TypeError. Here are the details:

python version: 2.7.16
torch version: 10.1
torchtext version: 0.2.3
CUDA version: 10.1

When running this:
python train.py -data $BASE/preprocess/roto -save_model $BASE/gen_model/$IDENTIFIER/roto -encoder_type1 mean -decoder_type1 pointer -enc_layers1 1 -dec_layers1 1 -encoder_type2 brnn -decoder_type2 rnn -enc_layers2 2 -dec_layers2 2 -batch_size 5 -feat_merge mlp -feat_vec_size 600 -word_vec_size 600 -rnn_size 600 -seed 1234 -start_checkpoint_at 4 -epochs 25 -optim adagrad -learning_rate 0.15 -adagrad_accumulator_init 0.1 -report_every 100 -copy_attn -truncated_decoder 100 -gpuid $GPUID -attn_hidden 64 -reuse_copy_attn -start_decay_at 4 -learning_rate_decay 0.97 -valid_batch_size 5

I got the following message:
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
/home/aiden/anaconda/anaconda2/lib/python2.7/site-packages/torch/nn/modules/rnn.py:54: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.3 and num_layers=1
"num_layers={}".format(dropout, num_layers))
/home/aiden/anaconda/anaconda2/lib/python2.7/site-packages/torch/nn/_reduction.py:46: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
Experiment 22-4.4 using attn_dim of 64
Loading train dataset from ../boxscore-data//preprocess/roto.train.1.pt, number of examples: 3371

vocabulary size. source1 = 1164; target1 = 391, source2 = 956; target2 = 9902
src feature 0 size = 702
src feature 1 size = 39
src feature 2 size = 4
Building model...
Intializing model parameters.
Intializing model parameters.
NMTModel(
(encoder): MeanEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(1164, 600, padding_idx=1)
(1): Embedding(702, 600, padding_idx=1)
(2): Embedding(39, 600, padding_idx=1)
(3): Embedding(4, 600, padding_idx=1)
)
(mlp): Sequential(
(0): Linear(in_features=2400, out_features=600, bias=True)
(1): ReLU()
)
)
)
(dropout): Dropout(p=0.3)
(attn): GlobalSelfAttention(
(transform_in): Sequential(
(0): Linear(in_features=600, out_features=64, bias=True)
(1): ELU(alpha=0.1)
)
(linear_in): Linear(in_features=64, out_features=64, bias=False)
(linear_out): Linear(in_features=1200, out_features=600, bias=False)
(sm): Softmax()
(tanh): Tanh()
(dropout): Dropout(p=0.3)
)
)
(decoder): PointerRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(391, 600, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.3)
(rnn): LSTM(600, 600, dropout=0.3)
(attn): PointerAttention(
(linear_in): Linear(in_features=600, out_features=600, bias=False)
(sm): LogSoftmax()
)
)
(generator): Sequential(
(0): Linear(in_features=600, out_features=391, bias=True)
(1): LogSoftmax()
)
)
NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(956, 600, padding_idx=1)
(1): Embedding(547, 600, padding_idx=1)
(2): Embedding(33, 600, padding_idx=1)
(3): Embedding(4, 600, padding_idx=1)
)
(mlp): Sequential(
(0): Linear(in_features=2400, out_features=600, bias=True)
(1): ReLU()
)
)
)
(rnn): LSTM(600, 300, num_layers=2, dropout=0.3, bidirectional=True)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(9902, 600, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.3)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.3)
(layers): ModuleList(
(0): LSTMCell(1200, 600)
(1): LSTMCell(600, 600)
)
)
(attn): GlobalAttention(
(linear_in): Linear(in_features=600, out_features=600, bias=False)
(linear_out): Linear(in_features=1200, out_features=600, bias=False)
(sm): Softmax()
(tanh): Tanh()
)
)
(generator): CopyGenerator(
(linear): Linear(in_features=600, out_features=9902, bias=True)
(linear_copy): Linear(in_features=600, out_features=1, bias=True)
)
)
number of parameters: 7062951
('encoder: ', 3348560)
('decoder: ', 3714391)
number of parameters: 26876703
('encoder: ', 6694200)
('decoder: ', 20182503)
Making optimizer for training.
Making optimizer for training.

Start training...

number of epochs: 25, starting from Epoch 1
batch size: 5

Loading train dataset from ../boxscore-data//preprocess/roto.train.1.pt, number of examples: 3371
/home/aiden/anaconda/anaconda2/lib/python2.7/site-packages/torch/nn/functional.py:1386: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
Traceback (most recent call last):
File "train.py", line 454, in
main()
File "train.py", line 446, in main
train_model(model1, model2, fields, optim1, optim2, data_type, model_opt)
File "train.py", line 256, in train_model
train_stats, train_stats2 = trainer.train(train_iter, epoch, report_func)
File "/home/aiden/files/harvardnlp/data2text-plan-py/onmt/Trainer.py", line 181, in train
report_stats, total_stats2, report_stats2, normalization)
File "/home/aiden/files/harvardnlp/data2text-plan-py/onmt/Trainer.py", line 345, in _gradient_accumulation
self.model(src, tgt, src_lengths, dec_state)
File "/home/aiden/anaconda/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/aiden/files/harvardnlp/data2text-plan-py/onmt/Models.py", line 684, in forward
memory_lengths=lengths)
File "/home/aiden/anaconda/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/aiden/files/harvardnlp/data2text-plan-py/onmt/Models.py", line 339, in forward
attns[k] = torch.stack(attns[k])
TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not Tensor

When I check Models.py, I got this from line 339 onwards:
for k in attns:
attns[k] = torch.stack(attns[k])

And this from _run_forward_pass():
# Calculate the attention.
p_attn = self.attn(rnn_output.transpose(0, 1).contiguous(), memory_bank.transpose(0, 1), memory_lengths=memory_lengths)
attns["std"] = p_attn

# decoder_outputs = self.dropout(decoder_outputs)
return decoder_final, None, attns

Is there anything wrong with _run_forward_pass()? Thanks.

Answer 1 · 2019-05-16T10:17:26.000Z

I think the issue is with the version of Pytorch. The Pytorch version required is 0.3.1 https://github.com/ratishsp/data2text-plan-py/blob/master/requirements.txt

Answer 2 · 2019-05-17T00:52:55.000Z

Hi! I just degraded pytorch and its dependencies then I'm getting these errors:

THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=88 error=30 : unknown error
Traceback (most recent call last):
File "train.py", line 54, in
cuda.set_device(opt.gpuid[0])
File "/home/aiden/anaconda/anaconda2/lib/python2.7/site-packages/torch/cuda/init.py", line 244, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (30) : unknown error at torch/csrc/cuda/Module.cpp:88

while nvidia-smi gives this:

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 602 G /usr/lib/Xorg 154MiB |
| 0 1285 G compton 3MiB |
+-----------------------------------------------------------------------------+
What could be wrong? Thanks.

Answer 3 · 2019-05-17T08:23:04.000Z

I assume that you have set the gpuid correctly.
Otherwise, it seems to be CUDA specific issue. In my setup I have CUDA 8.0. I haven't tried with CUDA 10.1