tensorflow/text

Can not load SentencePiece model

dcferreira opened this issue · 1 comments

I'm struggling with loading a sentencepiece model, and the error message is a bit cryptic so I'm not sure where to go next.

The error I get is the following:

2020-01-31 12:07:45.420864: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at sentencepiece_kernels.cc:211 : Internal: external/com_google_sentencepiece/src/sentencepiece_processor.cc(73) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 
Traceback (most recent call last):
  File "load.py", line 4, in <module>
    tokenizer = tensorflow_text.SentencepieceTokenizer('model.model')
  File "/home/dferreira/projects/porn_classifier_tf2/venv/lib/python3.7/site-packages/tensorflow_text/python/ops/sentencepiece_tokenizer.py", line 79, in __init__
    model=model)
  File "<string>", line 51, in sentencepiece_op
  File "<string>", line 125, in sentencepiece_op_eager_fallback
  File "/home/dferreira/projects/porn_classifier_tf2/venv/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: external/com_google_sentencepiece/src/sentencepiece_processor.cc(73) [model_proto->ParseFromArray(serialized.data(), serialized.size())]  [Op:SentencepieceOp]

I'm using Python 3.7.6 with:

tensorflow==2.1.0
tensorflow-text==2.1.0rc0
sentencepiece==0.1.85

The following is a minimal reproducible example:

  • Create a file raw_text with the content:
This is a raw text file.
With 2 lines.
  • Create train.py with the content:
import sentencepiece

sentencepiece.SentencePieceTrainer.Train('--input=raw_text --vocab_size=20 --model_prefix=model')
  • Run python train.py. You will get a model.model and model.vocab.
  • Create load.py with the content:
import tensorflow_text

tokenizer = tensorflow_text.SentencepieceTokenizer('model.model')
  • Run python load.py and you will get the error above.

It should be noted that loading the same model via sentencepiece.SentencePieceProcessor.Load works.

Like I said, I wasn't really able to interpret the error message.
How can I make this work?

The input is a serialized string containing the model (not the model file path). See [1] for an example of how to load the model file.

[1]

self.model = gfile.GFile(sentencepiece_model_file, 'r').read()