About _epoch_train and _epoch_val

Question

About _epoch_train and _epoch_val

fireholder opened this issue 5 years ago · 21 comments

When i was traning, I've met a problem that the progress came to a standstill. And I've found that it was the function _epoch_train and _epoch_val stopped it, which raises NotImplementedError. I wonder why and how to fix it.

fireholder commented 5 years ago

not yet

Answer 1 · 2019-08-05T08:39:28.000Z

hi, bro, I am trying to run the trainer.py, but I don't know about the argument "--load_model_path", there is nothing in the current folder, I am sure what kind of pretrain model need to load here, any advise?

Answer 2 · 2019-08-05T08:47:04.000Z

I think '--load_model_path' is only used when 'pretrained', but the log.txt shows error when not loading model files.

Answer 3 · 2019-08-05T10:18:25.000Z

Exactly, I got something in the logs.txt file like this :
Vocab Size:1173
[Load Model Failed] [Errno 2] No such file or directory: ''
[Load Model Failed] [Errno 21] Is a directory: '.'
[Load MLC Failed [Errno 21] Is a directory: '.'!]
[Load Co-attention Failed [Errno 21] Is a directory: '.'!]
[Load Sentence model Failed [Errno 21] Is a directory: '.'!]
[Load Word model Failed [Errno 21] Is a directory: '.'!]
Namespace(attention_version='v4', batch_size=16, caption_json='./data/new_data/.......

I thought program just stop here because of the error message.
So, I could just ignore the message, and keep training?
Are there other places need to be modified?

Answer 4 · 2019-08-05T11:09:37.000Z

I find that it's not stopped, it's just not printed.

Answer 5 · 2019-08-06T02:24:05.000Z

Yeah, I leave it to run all night, but I found val_loss is always 0 in logs.txt, there must something wrong and need to be modified

Answer 6 · 2019-08-06T02:34:24.000Z

Because in '_epoch_val' all val loss is set to 0, you can try uncomenting the code in '_epoch_val'. But I find my train loss very large, is it the same to you? By the way, have you tried the tester

Answer 7 · 2019-08-06T03:35:32.000Z

Yes, extremely large train loss. Haven't tried the tester yet

Answer 8 · 2019-08-07T04:26:31.000Z

I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely?

Answer 9 · 2019-08-07T08:29:13.000Z

Yes, just convert to tensor.cpu() as the error suggested.

Answer 10 · 2019-08-09T13:47:05.000Z

However , My test results are all the Same. All my predicted captions are the same

…

------------------ 原始邮件 ------------------ 发件人: "Ike-yang"<notifications@github.com>; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"<Medical-Report-Generation@noreply.github.com>; 抄送: "横舟"<xuwenting33@qq.com>; "Author"<author@noreply.github.com>; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 11 · 2019-08-10T00:32:15.000Z

I have the same caption too. Can you find the reason？------------------ 原始邮件 ------------------ 发件人: "xwt"<notifications@github.com> 发送时间: 2019年8月9日(星期五) 晚上9:47 收件人: "ZexinYan/Medical-Report-Generation"<Medical-Report-Generation@noreply.github.com>; 抄送: "Subscribed"<subscribed@noreply.github.com>; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) However , My test results are all the Same. All my predicted captions are the same

…

------------------ 原始邮件 ------------------ 发件人: "Ike-yang"<notifications@github.com>; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"<Medical-Report-Generation@noreply.github.com>; 抄送: "横舟"<xuwenting33@qq.com>; "Author"<author@noreply.github.com>; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 12 · 2019-09-08T17:29:35.000Z

When I run
python tester.py

FileNotFoundError: [Errno 2] No such file or directory: './data/new_data/debug_vocab.pkl'

Answer 13 · 2019-11-11T13:52:19.000Z

Did u guys met the problem like"

WARNING:tensorflow:From /content/drive/Shared drives/shared drive-zma/ACL18/utils/logger.py:15: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Traceback (most recent call last):
File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 662, in
debugger.train()
File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 60, in train
train_tag_loss, train_stop_loss, train_word_loss, train_loss = self._epoch_train() #???
File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 402, in _epoch_train
batch_tag_loss = self.mse_criterion(tags, self._to_var(label, requires_grad=False)).sum() # ???
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 431, in forward
return F.mse_loss(input, target, reduction=self.reduction)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 2203, in mse_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/usr/local/lib/python3.6/dist-packages/torch/functional.py", line 52, in broadcast_tensors
return torch._C._VariableFunctions.broadcast_tensors(tensors)

RuntimeError: The size of tensor a (210) must match the size of tensor b (0) at non-singleton dimension 1
"
it's really make me confused, anyone could do me a favor? Thx!

Answer 14 · 2019-12-03T11:39:27.000Z

However , My test results are all the Same. All my predicted captions are the same
…
------------------ 原始邮件 ------------------ 发件人: "Ike-yang"notifications@github.com; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "横舟"xuwenting33@qq.com; "Author"author@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Hi @fireholder! Did you eventually give up trying to solve the issue? were all the predicted captions always identical?

Answer 15 · 2020-04-26T04:33:01.000Z

My train loss is also very large. And all my predicted captions are the same: "No acute cardiopulmonary abnormality", could anyone do me a favor? Thx! Is it because of Python2 and Python3, since I used python3.

Answer 16 · 2020-05-17T00:33:33.000Z

Yes, extremely large train loss. Haven't tried the tester yet

Hi, you were able to decrease the loss. I am also facing the same issue.

Answer 17 · 2020-05-17T00:35:01.000Z

I have the same caption too. Can you find the reason？------------------ 原始邮件 ------------------ 发件人: "xwt"notifications@github.com 发送时间: 2019年8月9日(星期五) 晚上9:47 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "Subscribed"subscribed@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) However , My test results are all the Same. All my predicted captions are the same
…
------------------ 原始邮件 ------------------ 发件人: "Ike-yang"notifications@github.com; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "横舟"xuwenting33@qq.com; "Author"author@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

I am also facing the same issue. Are you able to solve this?

Answer 18 · 2020-09-04T14:46:28.000Z

My train loss is also very large. And all my predicted captions are the same: "No acute cardiopulmonary abnormality", could anyone do me a favor? Thx! Is it because of Python2 and Python3, since I used python3.

I guess train loss is large, because author uses MSELoss for predicting tags. If he has 156 different tags, then the exponent ~ (156-0)^2 = 24336. That is why so big loss

You can change it L1Loss or decrease lambda argument for tags loss (if you find it reasonable).

Answer 19 · 2020-09-20T08:40:22.000Z

In debugger.py and tester.py file of the given project. I'm facing an error at 3rd last line in the following given section of code.
` tag_loss += self.args.lambda_tag * batch_tag_loss.data
stop_loss += self.args.lambda_stop * batch_stop_loss.data
word_loss += self.args.lambda_word * batch_word_loss.data
loss += batch_loss.data

return tag_loss, stop_loss, word_loss, loss`

Error is :
File "D:/Hareem/Auto_report/debugger.py", line 61, in train train_tag_loss, train_stop_loss, train_word_loss, train_loss = self._epoch_train() File "D:/Hareem/Auto_report/debugger.py", line 424, in _epoch_train word_loss += self.args.lambda_word * batch_word_loss.data AttributeError: 'int' object has no attribute 'data'

Answer 20 · 2021-12-08T12:23:36.000Z

Is there anybody who solve the problem predicting captions all the same?