About _epoch_train and _epoch_val
fireholder opened this issue · 21 comments
When i was traning, I've met a problem that the progress came to a standstill. And I've found that it was the function _epoch_train and _epoch_val stopped it, which raises NotImplementedError. I wonder why and how to fix it.
hi, bro, I am trying to run the trainer.py, but I don't know about the argument "--load_model_path", there is nothing in the current folder, I am sure what kind of pretrain model need to load here, any advise?
I think '--load_model_path' is only used when 'pretrained', but the log.txt shows error when not loading model files.
Exactly, I got something in the logs.txt file like this :
Vocab Size:1173
[Load Model Failed] [Errno 2] No such file or directory: ''
[Load Model Failed] [Errno 21] Is a directory: '.'
[Load MLC Failed [Errno 21] Is a directory: '.'!]
[Load Co-attention Failed [Errno 21] Is a directory: '.'!]
[Load Sentence model Failed [Errno 21] Is a directory: '.'!]
[Load Word model Failed [Errno 21] Is a directory: '.'!]
Namespace(attention_version='v4', batch_size=16, caption_json='./data/new_data/.......
I thought program just stop here because of the error message.
So, I could just ignore the message, and keep training?
Are there other places need to be modified?
I find that it's not stopped, it's just not printed.
Yeah, I leave it to run all night, but I found val_loss is always 0 in logs.txt, there must something wrong and need to be modified
Because in '_epoch_val' all val loss is set to 0, you can try uncomenting the code in '_epoch_val'. But I find my train loss very large, is it the same to you? By the way, have you tried the tester
Yes, extremely large train loss. Haven't tried the tester yet
I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely?
Yes, just convert to tensor.cpu() as the error suggested.
not yet
When I run
python tester.py
FileNotFoundError: [Errno 2] No such file or directory: './data/new_data/debug_vocab.pkl'
Did u guys met the problem like"
WARNING:tensorflow:From /content/drive/Shared drives/shared drive-zma/ACL18/utils/logger.py:15: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
Traceback (most recent call last):
File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 662, in
debugger.train()
File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 60, in train
train_tag_loss, train_stop_loss, train_word_loss, train_loss = self._epoch_train() #???
File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 402, in _epoch_train
batch_tag_loss = self.mse_criterion(tags, self._to_var(label, requires_grad=False)).sum() # ???
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 431, in forward
return F.mse_loss(input, target, reduction=self.reduction)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 2203, in mse_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/usr/local/lib/python3.6/dist-packages/torch/functional.py", line 52, in broadcast_tensors
return torch._C._VariableFunctions.broadcast_tensors(tensors)
RuntimeError: The size of tensor a (210) must match the size of tensor b (0) at non-singleton dimension 1
"
it's really make me confused, anyone could do me a favor? Thx!
However , My test results are all the Same. All my predicted captions are the same
…
------------------ 原始邮件 ------------------ 发件人: "Ike-yang"notifications@github.com; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "横舟"xuwenting33@qq.com; "Author"author@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hi @fireholder! Did you eventually give up trying to solve the issue? were all the predicted captions always identical?
My train loss is also very large. And all my predicted captions are the same: "No acute cardiopulmonary abnormality", could anyone do me a favor? Thx! Is it because of Python2 and Python3, since I used python3.
Yes, extremely large train loss. Haven't tried the tester yet
Hi, you were able to decrease the loss. I am also facing the same issue.
I have the same caption too. Can you find the reason?------------------ 原始邮件 ------------------ 发件人: "xwt"notifications@github.com 发送时间: 2019年8月9日(星期五) 晚上9:47 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "Subscribed"subscribed@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) However , My test results are all the Same. All my predicted captions are the same
…
------------------ 原始邮件 ------------------ 发件人: "Ike-yang"notifications@github.com; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "横舟"xuwenting33@qq.com; "Author"author@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
I am also facing the same issue. Are you able to solve this?
My train loss is also very large. And all my predicted captions are the same: "No acute cardiopulmonary abnormality", could anyone do me a favor? Thx! Is it because of Python2 and Python3, since I used python3.
I guess train loss is large, because author uses MSELoss for predicting tags. If he has 156 different tags, then the exponent ~ (156-0)^2 = 24336. That is why so big loss
You can change it L1Loss or decrease lambda argument for tags loss (if you find it reasonable).
In debugger.py and tester.py file of the given project. I'm facing an error at 3rd last line in the following given section of code.
` tag_loss += self.args.lambda_tag * batch_tag_loss.data
stop_loss += self.args.lambda_stop * batch_stop_loss.data
word_loss += self.args.lambda_word * batch_word_loss.data
loss += batch_loss.data
return tag_loss, stop_loss, word_loss, loss`
Error is :
File "D:/Hareem/Auto_report/debugger.py", line 61, in train train_tag_loss, train_stop_loss, train_word_loss, train_loss = self._epoch_train() File "D:/Hareem/Auto_report/debugger.py", line 424, in _epoch_train word_loss += self.args.lambda_word * batch_word_loss.data AttributeError: 'int' object has no attribute 'data'
Is there anybody who solve the problem predicting captions all the same?