microsoft/onnxruntime-extensions

How to export a custom SentencePiece Model/Tokenizer with pnp.SequentialProcessingModule

pyyush opened this issue · 9 comments

As shown below, the README.md shows how to export a HuggingFace BERT Tokenizer with pnp.SequentialProcessingModule

tokenizer = AutoTokenizer.from_pretrained(model_name)
bert_tokenizer = pnp.PreHuggingFaceBert(hf_tok=tokenizer)
bert_model = onnx.load_model(str(model_path))
augmented_model = pnp.SequentialProcessingModule(bert_tokenizer, map_token_output, bert_model, post_process)

How can I do the same for a custom SentencePiece model/tokenizer ?

import sentencepiece as spm
tokenizer = spm.SentencePieceProcessor(model_file=model)

It could be supported, but it needs some work.
I will put it in the backlog.

Thanks, also I tried exporting XLM-RoBERTa tokenizer instead of BERT tokenizer and it failed as show below.

tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')

Error

AttributeError: 'XLMRobertaTokenizerFast' object has no attribute 'do_lower_case'

My guess is do_lower_case should default to 0 if not present?

@wenbingl any update on this?

Yes, this is a working PR for Roberta Tokenizer,
#365.

@sayanshaw24 did your test already include the pyyush code above?

I think the pyyush code above is for XLMRobertaTokenizer, not regular RobertaTokenizer (as implemented in the PR). However, I believe we looked into this issue in #311.

@wenbingl and @sayanshaw24 thanks for working on this issue. A few things -

  1. Looks like support for sentencepiece tokenizer is still missing. Any updates on that?

  2. I tried out the RobertaTokenizer as follows

from transformers import AutoTokenizer
from onnxruntime_extensions import pnp
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
bert_tokenizer = pnp.PreHuggingFaceBert(hf_tok=tokenizer)
bert_tokenizer(["Hello world!"])

and got the following error -

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: [BertTokenizerVocab]: can not find tokens: [UNK]

@wenbingl Yes. Adding support for augmenting an onnx model with a sentencepiece tokenizer will make things easier on the inference side and also ensure that users are not tied to HuggingFace.

We have added a cvt function to build RobertaTokenizer and tested it, so it should be fine to use as such: https://github.com/microsoft/onnxruntime-extensions/blob/main/test/test_robertatok.py#L94