lavis-nlp/spert

the training stopped when loading weights file http://

victorbai2 opened this issue · 8 comments

Hi @markus-eberts ,

really appreciated your genius work and dedicated effort on information extraction. It is really inspiring.
Recently, when I intended to try spert on my env, the training failed with last line below and did not throw any error:
loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin from cache at /root/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2

Do you think it is because that the model could not be downloaded from aws? As a possible workaround I downloaded model file : bert-base-cased-pytorch_model.bin, and replaced "self.args.model_path" on
model = model_class.from_pretrained(
self.args.model_path,
.....)

But it did not work out. Do you know what is the cause?

Hi,

this is often caused by the program being killed due to not sufficient memory when loading BERT's weights. Are you training on cpu or gpu? And how much (unoccupied) cpu/gpu is available on your system?

Hi,

this is often caused by the program being killed due to not sufficient memory when loading BERT's weights. Are you training on cpu or gpu? And how much (unoccupied) cpu/gpu is available on your system?
@markus-eberts
It is CPU with 6G available memory.
But why it was prompted with "short of memory" message. Instead, just stopped.
And how to check if it is because of the memory shortage.

@markus-eberts
I also tried to change "model_path = data/models/conll04/" in example_train.conf. But it stopped still, even on the different machine with GPU.

Hi, @markus-eberts
I think I solved the problem by upgrading transformers to latest version and also installed "pip install sentencepiece==0.1.91".
It now can be trained.

Maybe you can update source to reflect it, since I believe other users may also have encountered this problem.

Hi, @markus-eberts
I think I solved the problem by upgrading transformers to latest version and also installed "pip install sentencepiece==0.1.91".
It now can be trained.

Maybe you can update source to reflect it, since I believe other users may also have encountered this problem.

Hi, I met same problem as you did before and I follow your advice to upgrade transformers to latest version and also install "pip install sentencepiece==0.1.91", but there is another problem which is as follows:

...
2021-01-03 20:53:15,885 [MainThread ] [INFO ] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin from cache at /home/yuhai/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
2021-01-03 20:54:00,475 [MainThread ] [INFO ] Weights of SpERT not initialized from pretrained model: ['rel_classifier.weight', 'rel_classifier.bias', 'entity_classifier.weight', 'entity_classifier.bias', 'size_embeddings.weight']
2021-01-03 20:54:00,475 [MainThread ] [INFO ] Weights from pretrained model not used in SpERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']

then the process is just running but it seems that training is not going. I wanna ask why and I will appreciate it if you could find out the solution~

Hi, @markus-eberts
I think I solved the problem by upgrading transformers to latest version and also installed "pip install sentencepiece==0.1.91".
It now can be trained.
Maybe you can update source to reflect it, since I believe other users may also have encountered this problem.

Hi, I met same problem as you did before and I follow your advice to upgrade transformers to latest version and also install "pip install sentencepiece==0.1.91", but there is another problem which is as follows:

...
2021-01-03 20:53:15,885 [MainThread ] [INFO ] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin from cache at /home/yuhai/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
2021-01-03 20:54:00,475 [MainThread ] [INFO ] Weights of SpERT not initialized from pretrained model: ['rel_classifier.weight', 'rel_classifier.bias', 'entity_classifier.weight', 'entity_classifier.bias', 'size_embeddings.weight']
2021-01-03 20:54:00,475 [MainThread ] [INFO ] Weights from pretrained model not used in SpERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']

then the process is just running but it seems that training is not going. I wanna ask why and I will appreciate it if you could find out the solution~

Maybe you can try to bring transformer version back to older one, and manually download pretrained model and specify the path in Spert_trainer.py.

Hi @YuHaiA,

the logging output you posted is to be expected and nothing to worry about.

then the process is just running but it seems that training is not going

It's strange that the process is running without any output (you should see a progress bar during training). Could you check where exactly the training stops?

@victorbai2

Regarding libraries: It seems as if the bert-base-cased model stopped working with the 'transformers' version specified for SpERT. I just upgraded to the newest 'transformers' version via 'pip install transformers[sentencepiece]'. This also installs the correct sentencepiece library. Everything seems to work as expected. Hopefully nothing is broken due to these changes - I currently do not have the time to do an in-depth check. Thanks for your help, I updated the repository.