How to export a custom SentencePiece Model/Tokenizer with pnp.SequentialProcessingModule

Question

How to export a custom SentencePiece Model/Tokenizer with pnp.SequentialProcessingModule

pyyush opened this issue 2 years ago · 9 comments

As shown below, the README.md shows how to export a HuggingFace BERT Tokenizer with pnp.SequentialProcessingModule

tokenizer = AutoTokenizer.from_pretrained(model_name)
bert_tokenizer = pnp.PreHuggingFaceBert(hf_tok=tokenizer)
bert_model = onnx.load_model(str(model_path))
augmented_model = pnp.SequentialProcessingModule(bert_tokenizer, map_token_output, bert_model, post_process)

How can I do the same for a custom SentencePiece model/tokenizer ?

import sentencepiece as spm
tokenizer = spm.SentencePieceProcessor(model_file=model)

Answer 1 · 2022-10-11T22:42:16.000Z

It could be supported, but it needs some work.
I will put it in the backlog.

Answer 2 · 2022-10-13T14:59:19.000Z

Thanks, also I tried exporting XLM-RoBERTa tokenizer instead of BERT tokenizer and it failed as show below.

tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')

Error

AttributeError: 'XLMRobertaTokenizerFast' object has no attribute 'do_lower_case'

My guess is do_lower_case should default to 0 if not present?

Answer 3 · 2023-02-23T13:30:48.000Z

@wenbingl any update on this?

Answer 4 · 2023-02-23T18:51:28.000Z

Yes, this is a working PR for Roberta Tokenizer,
#365.

@sayanshaw24 did your test already include the pyyush code above?

Answer 5 · 2023-02-23T19:15:00.000Z

I think the pyyush code above is for XLMRobertaTokenizer, not regular RobertaTokenizer (as implemented in the PR). However, I believe we looked into this issue in #311.

Answer 6 · 2023-03-09T13:54:15.000Z

@wenbingl and @sayanshaw24 thanks for working on this issue. A few things -

Looks like support for sentencepiece tokenizer is still missing. Any updates on that?
I tried out the RobertaTokenizer as follows

from transformers import AutoTokenizer
from onnxruntime_extensions import pnp
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
bert_tokenizer = pnp.PreHuggingFaceBert(hf_tok=tokenizer)
bert_tokenizer(["Hello world!"])

and got the following error -

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: [BertTokenizerVocab]: can not find tokens: [UNK]

Answer 7 · 2023-03-10T00:08:36.000Z

https://github.com/microsoft/onnxruntime-extensions/blob/main/operators/tokenizer/sentencepiece_tokenizer.hpp, is this what you are looking for?
@sayanshaw24 , can you add a cvt function here to build the RobertaTokenizer? https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/cvt.py

Answer 8 · 2023-03-10T11:03:14.000Z

@wenbingl Yes. Adding support for augmenting an onnx model with a sentencepiece tokenizer will make things easier on the inference side and also ensure that users are not tied to HuggingFace.

Answer 9 · 2023-03-15T23:25:07.000Z

We have added a cvt function to build RobertaTokenizer and tested it, so it should be fine to use as such: https://github.com/microsoft/onnxruntime-extensions/blob/main/test/test_robertatok.py#L94