Error after 55th epoch while saving the model

Question

Error after 55th epoch while saving the model

SravyaMadupu opened this issue 4 years ago · 10 comments

The model is running fine for 55 epochs and but at the 55th epoch when it is to be saved it is throwing an error as below. Any idea about this error?

epoch 54 total loss 0.2056220420028012 0.00021576289821909886
aida-A micro F1: 0.9303830497860348
aida-B micro F1: 0.9400222965440357
msnbc micro F1: 0.9426166794185157
aquaint micro F1: 0.8755244755244754
ace2004 micro F1: 0.8933601609657947
clueweb micro F1: 0.742094861660079
wikipedia micro F1: 0.7821906663708305
change learning rate to 0.0001
att_mat_diag
tok_score_mat_diag
entity2entity_mat_diag
entity2entity_score_mat_diag
knowledge2entity_mat_diag
knowledge2entity_score_mat_diag
type_emb
cnn.weight
cnn.bias
score_combine.0.weight
score_combine.0.bias
score_combine.3.weight
score_combine.3.bias
save model to model
Traceback (most recent call last):
File "main.py", line 226, in
ranker.train(conll.train, dev_datasets, config)
File "/content/drive/My Drive/data.tar.gz (Unzipped Files)/DCA/ed_ranker.py", line 1032, in train
self.model.save(self.args.model_path)
File "/content/drive/My Drive/data.tar.gz (Unzipped Files)/DCA/abstract_word_entity.py", line 78, in save
json.dump(config, f)
File "/usr/lib/python3.6/json/init.py", line 179, in dump
for chunk in iterable:
File "/usr/lib/python3.6/json/encoder.py", line 430, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/usr/lib/python3.6/json/encoder.py", line 404, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.6/json/encoder.py", line 437, in _iterencode
o = _default(o)
File "/usr/lib/python3.6/json/encoder.py", line 180, in default
o.class.name)
TypeError: Object of type 'set' is not JSON serializable

Answer 1 · 2020-09-28T07:47:08.000Z

This is the first time that I have encountered this kind of problem which is never be reported by other users.

Did that error occur when saving the model for the first time? Maybe it occured due to the unstable platform environment.

According to the error log "File '/content/drive/My Drive/data.tar.gz (Unzipped Files)/DCA/abstract_word_entity.py', line 78, in save json.dump(config, f)", I guess that maybe you need to place the DCA project in a normal directory instead of a .tar.gz file, give it a try. (-:

Answer 2 · 2020-09-28T07:55:16.000Z

It is just a file name. It is a normal directory. I unzipped the folder to google drive and the name was auto-generated. It is a normal directory. :-| Maybe the error is because of the changes I made then.

I have made some changes to the code to run it on CPU. Using GPU each epoch is taking a lot of time. Strangely on CPU epoch is able to run in 4 mins and on GPU each epoch is taking more than 45 minutes. Is the behavior normal?

Answer 3 · 2020-09-28T08:18:41.000Z

I apologize for that I have never encountered that strange problem before.

First, I would like to check which training method you choose? Supervised Learning or Reinforcement learning?

Second, is it convenient for you to provide some hardware information about your platform? We trained our framework on the one GeForce GTX 1080 card with 8GB memory and two Intel(R) Xeon(R) CPUs (E5-2683 v3@2.00GHz), with sufficient memory size (about 384 GB) and SSD storage (about 3.1 TB).

I have the impression that the DCA framework should run faster on GPU than CPU, and under the Supervised Learning settting, each epoch costs significantly less than 45 minutes on GPU while costs much more than 4 mins on CPU.

Maybe you could keep the reivsed code running and then check whether the final results are normal or not.

Answer 4 · 2020-09-28T08:39:38.000Z

I am trying to run supervised learning with using the arguments: --mode train --order offset --model_path model --method SL

I am running the code in google colab: The configuration is RAM: 12 GB Disk space is: 64 GB.

Supervised learning using GPU is running for more than an hour now and still, the first epoch is not completed yet.

I am using the revised code and still not able to run it successfully. Also could you please tell me a rough estimate of the time it would take to run the code on the above-said configuration.

Answer 5 · 2020-09-28T09:08:44.000Z

I think that your basic hardware environment should be sufficient for DCA training, for that DCA framework is not resource-consuming.

And do you know what's the size of your GPU card?

Honestly speaking, I am a little confused about the current situation you are facing.

On the one hand, the title of this Issue is "Error after 55th epoch while saving the model " which means that you have trained DCA framework at least 54 epoches. On the other hand, you mentioned that "Supervised learning using GPU is running for more than an hour now and still, the first epoch is not completed yet".

So it seems that you have runned the code more than two days? But you opened an Issue "Unable to run on google colab" 11 hours ago.

So I would like to check that how do you know that the running code are still in the first epoch, by a log file, or printed text on the screen, or the output csv file?

Thanks.

Answer 6 · 2020-09-28T09:47:52.000Z

Sorry for all the confusion.

Let me explain so more in detail about my situation. For running the code on both CPU and GPU, I am using Google Colab. I opened the previous issue when I was trying to use the code as it is without any changes. I was using GPU initially and got CUDA out of memory.

Then I changed the code so that all the computations are done on CPU instead of GPU. Now when I am running the code on CPU, I encountered the error at epoch 55 while saving the model. It took 6 hours to run and stopped with the above-said error.

Then I tried to run the actual code again in GPU magically it started running without any memory issue. However, it is taking a lot of time for each epoch when I am trying to run on a GPU.

I started the current session 2 hours back and the log is as follows:

load conll at ../data/generated/test_train_data
load csv
370United News of India
process coref
load conll
reorder mentions within the dataset
create model
tcmalloc: large alloc 1181786112 bytes == 0xc900000 @ 0x7f30d5b021e7 0x7f30cfec45e1 0x7f30cff2d90d 0x7f30cff2e522 0x7f30cffc5bce 0x50a7f5 0x50cfd6 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 0x507f24 0x50b053 0x634dd2 0x634e87 0x63863f 0x6391e1 0x4b0dc0 0x7f30d56ffb97 0x5b26fa
--- create EDRanker model ---
prerank model
--- create NTEE model ---
--- create AbstractWordEntity model ---
main model
create new model
--- create MulRelRanker model ---
--- create LocalCtxAttRanker model ---
--- create AbstractWordEntity model ---
training...
extracting training data
108
222
288
112
182
211
105
105
recall 1.0
#train docs 953
277
recall 0.9772490085577124
aida-A #dev docs 218
108
114
recall 0.9866220735785953
aida-B #dev docs 232
recall 0.9847560975609756
msnbc #dev docs 20
recall 0.9408528198074277
aquaint #dev docs 50
recall 0.914396887159533
ace2004 #dev docs 35
recall 0.9190424959655729
clueweb #dev docs 320
recall 0.93214074512123
wikipedia #dev docs 318
creating optimizer
att_mat_diag
tok_score_mat_diag
entity2entity_mat_diag
entity2entity_score_mat_diag
knowledge2entity_mat_diag
knowledge2entity_score_mat_diag
type_emb
cnn.weight
cnn.bias
score_combine.0.weight
score_combine.0.bias
score_combine.3.weight
score_combine.3.bias
tensor([274474], device='cuda:0')

After this, the cell is still executing for the past 2 hours and memory usage is changing frequently. Still, I don't see epoch 0 results.

I am very much confused about what is going on in this case. :-(

Answer 7 · 2020-09-28T11:39:46.000Z

Thank you for your quick reply.

First, I am sure that the DCA framework does not run successfully on GPU card according to your provided log information on the screen, because the loss information should be printed after each mini-batch being processed. I think you could find that printed loss information when you run the DCA framework on CPU.

Then I guess the reason for your observed phenomenon that memory usage is changing frequently, may could be the unsuitable python environment you adopt. Please check that you have installed the exact Pytorch GPU version rather than Pytorch CPU version.

As to the above-said error when running on CPU, I have to say that error looks so strange. The error information illustrates that "TypeError: Object of type 'set' is not JSON serializable", but after I check the potential error code segement in the file "abstract_word_entity.py" (#69 ~ #78), I find that "config" variable are truly a dict rather than a set.

I guess that error is occured due to the unstable system environment, and I suggest that you could try it again and observe the subsequent status.

Answer 8 · 2020-09-28T13:12:49.000Z

Thank you so much for all the help. I would try again on another system and get back to you. Just curious how much time does it take to run the code for your configuration?

Answer 9 · 2020-09-29T05:42:36.000Z

Sorry for my late reply.

According to my impression, we spent about half a day training the DCA framework (Supervised Learning) for about 150 epoches, based on the above-said hardware configuration and default framework parameters.

Answer 10 · 2020-09-29T15:15:30.000Z

I was not able to replicate the results even after all the changes. I have saved the state_dict file after the 55th epoch and passed it to the model again with a changed learning rate and made some changes JSON.dump function. Now it is working. Thank you so much for all the help. I have seen the previous issues also. You are being really helpful and very quick in replies. Appreciate your effort. :-D