Can I use SentencepieceTokenizer in C#?
tylike opened this issue · 5 comments
hi!
I have an LLM in Onnx format and a sentencepiece.model, and I used HuggingFace and SententPiece together in Python. Now I plan to do inference in C# + OnnxRuntime, but I haven't found a suitable version of the SententPiece library in C#. I saw that there is a SentencepieceTokenizer here. Can I use SentencepieceTokenizer in C#?
My files were downloaded from here: https://huggingface.co/K024/ChatGLM-6b-onnx-u8s8/tree/main/chatglm-6b-int8-onnx-merged. Thank you."
Yes, the Nuget package could be found here: https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.Extensions/0.8.0
I only saw the C# demo code for registering the extension.
In Python, I need to use the tokenizer to get the Ids of the user input text first when using the LLM model for inference, and then do the subsequent processing.
But I didn’t find a class that wraps this tokenizer as a C# version in this extension library, did I misunderstand it?
Can you give some examples?
For example: I defined it like this in python:
from sentencepiece import SentencePieceProcessor
sp_model = SentencePieceProcessor(model_file=model_path)
ids = sp_model.encode(s)
@sayanshaw24 , can you add SPM tokenizer into our C# example?