microsoft/onnxruntime-extensions

Can I use SentencepieceTokenizer in C#?

tylike opened this issue · 5 comments

tylike commented

hi!

I have an LLM in Onnx format and a sentencepiece.model, and I used HuggingFace and SententPiece together in Python. Now I plan to do inference in C# + OnnxRuntime, but I haven't found a suitable version of the SententPiece library in C#. I saw that there is a SentencepieceTokenizer here. Can I use SentencepieceTokenizer in C#?

My files were downloaded from here: https://huggingface.co/K024/ChatGLM-6b-onnx-u8s8/tree/main/chatglm-6b-int8-onnx-merged. Thank you."

tylike commented

I only saw the C# demo code for registering the extension.
In Python, I need to use the tokenizer to get the Ids of the user input text first when using the LLM model for inference, and then do the subsequent processing.
But I didn’t find a class that wraps this tokenizer as a C# version in this extension library, did I misunderstand it?
Can you give some examples?

For example: I defined it like this in python:
from sentencepiece import SentencePieceProcessor
sp_model = SentencePieceProcessor(model_file=model_path)
ids = sp_model.encode(s)

@sayanshaw24 , can you add SPM tokenizer into our C# example?