huggingface/tokenizers

How to write custom Wordpiece class?

xinyinan9527 opened this issue · 0 comments

My aim is get the rwkv5 model‘s "tokenizer.json",but it implemented through slow tokenizer(class Pretrainedtokenizer).
I want to convert "slow tokenizer" to "fast tokenizer",it needs to use "tokenizer = Tokenizer(Wordpiece())",but rwkv5 has it‘s own Wordpiece file.
So I want to create a custom Wordpiece

the code is here

from tokenizers.models import Model
class MyWordpiece(Model):
    def __init__(self,vocab,unk_token):
        self.vocab = vocab
        self.unk_token = unk_token



test = MyWordpiece('./vocab.txt',"<s>")
Traceback (most recent call last):
  File "test.py", line 78, in <module>
    test = MyWordpiece('./vocab.txt',"<s>")
TypeError: Model.__new__() takes 0 positional arguments but 2 were given