How to export a custom SentencePiece Model/Tokenizer with pnp.SequentialProcessingModule
pyyush opened this issue · 9 comments
As shown below, the README.md shows how to export a HuggingFace BERT Tokenizer with pnp.SequentialProcessingModule
tokenizer = AutoTokenizer.from_pretrained(model_name)
bert_tokenizer = pnp.PreHuggingFaceBert(hf_tok=tokenizer)
bert_model = onnx.load_model(str(model_path))
augmented_model = pnp.SequentialProcessingModule(bert_tokenizer, map_token_output, bert_model, post_process)
How can I do the same for a custom SentencePiece model/tokenizer ?
import sentencepiece as spm
tokenizer = spm.SentencePieceProcessor(model_file=model)
It could be supported, but it needs some work.
I will put it in the backlog.
Thanks, also I tried exporting XLM-RoBERTa tokenizer instead of BERT tokenizer and it failed as show below.
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
Error
AttributeError: 'XLMRobertaTokenizerFast' object has no attribute 'do_lower_case'
My guess is do_lower_case
should default to 0 if not present?
Yes, this is a working PR for Roberta Tokenizer,
#365.
@sayanshaw24 did your test already include the pyyush code above?
I think the pyyush code above is for XLMRobertaTokenizer, not regular RobertaTokenizer (as implemented in the PR). However, I believe we looked into this issue in #311.
@wenbingl and @sayanshaw24 thanks for working on this issue. A few things -
-
Looks like support for sentencepiece tokenizer is still missing. Any updates on that?
-
I tried out the RobertaTokenizer as follows
from transformers import AutoTokenizer
from onnxruntime_extensions import pnp
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
bert_tokenizer = pnp.PreHuggingFaceBert(hf_tok=tokenizer)
bert_tokenizer(["Hello world!"])
and got the following error -
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: [BertTokenizerVocab]: can not find tokens: [UNK]
- https://github.com/microsoft/onnxruntime-extensions/blob/main/operators/tokenizer/sentencepiece_tokenizer.hpp, is this what you are looking for?
- @sayanshaw24 , can you add a cvt function here to build the RobertaTokenizer? https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/cvt.py
@wenbingl Yes. Adding support for augmenting an onnx model with a sentencepiece tokenizer will make things easier on the inference side and also ensure that users are not tied to HuggingFace.
We have added a cvt function to build RobertaTokenizer and tested it, so it should be fine to use as such: https://github.com/microsoft/onnxruntime-extensions/blob/main/test/test_robertatok.py#L94