huggingface/tokenizers

BPE trainer ignoring special tokens.

henrycharlesworth opened this issue · 3 comments

I am trying to train a custom tokenizer. My use case is related to assembly code, so I want merges to be possible across full instructions (potentially multiple "words"). To do this, I am replacing all spaces with a dummy token (e.g. "<space>"), and have a pretokenizer that splits on "\n". This basically works, but my issue comes when I try to add in special tokens. The following is a simple example to reproduce the issue:

from tokenizers import Tokenizer, Regex
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Sequence as PretokenizerSequence, Split
from tokenizers.normalizers import Sequence as NormalizerSequence, Replace, BertNormalizer, Strip


corpus_file = "corpus.txt"
special_tokens = [
    "<s>",
    "<pad>",
    "</s>",
    "<unk>"
]
for i in range(20):
    special_tokens.append(f"<disasm_function_{i}>")
    special_tokens.append(f"<disasm_string_{i}>")

tokenizer = Tokenizer(BPE())
tokenizer.add_special_tokens(special_tokens)

tokenizer.normalizer = NormalizerSequence([
    Strip(),
    BertNormalizer(clean_text=True, strip_accents=True, lowercase=True),
    Replace(Regex("\s{2,}"), " "),
    Replace(" ", "<space>")
])
tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
])

trainer = BpeTrainer(
    special_tokens=special_tokens, vocab_size=10000, min_frequency=2,
)
tokenizer.train(files=[corpus_file], trainer=trainer)

tokenizer.save("example_tokenizer.json")

An example segment of my corpus I am using to train will look something like:

lea rsi,<code_addr_1> <string_literal><disasm_string_0></string_literal> <eoi>
mov edi, eax <eoi>
call <external>::<function_name><disasm_function_1></function_name> <eoi>
mov rax, qword ptr <local_var_0> <eoi>
mov rdi, rax <eoi>
call <external>::<function_name><disasm_function_2></function_name> <eoi>
mov rax, qword ptr <local_var_0> <eoi>
mov rax, qword ptr [rax]<unk_0> <eoi>
mov rdi, rax <eoi>
call <external>::<function_name><disasm_function_3></function_name> <eoi>

so the aim is to ensure that e.g. <disasm_function_1> is always a single token. This works at test time (i.e. these special tokens are always tokenized as single tokens), but it's clearly not happening during the BPE training. If I examine the tokens/merges I am getting out, many of them contain the special tokens within them. E.g. from the resulting JSON file:

"</return_val><space><calling_conv>stdcall</calling_conv><func_name><disasm_function_0></func_name><parameters>(": 370,
      "pop<space>r1": 371,
      "call<space><external>::<function_name><disasm_function_2></function_name><space><eoi>": 372,

you can see these learned tokens contain the special tokens within them.

Is this expected behaviour? My assumption was that the BPE trainer would prevent this from happening (as I provide it with a list of the special tokens - why else would it need this argument)? And it's not very desirable to fill up the vocab with lots of merges that aren't ever going to be valid.

Is there anyway to stop this from happening (or is there anything that I haven't set up properly?)

EDIT:

My current horrible workaround is to do:

tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
] + [Split(tok, behavior="isolated") for tok in special_tokens])

which seems to work, but can't be the best way.

Hey! you are adding the tokens before initializing the normalizer, this worked for me:

from tokenizers import Tokenizer, Regex
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Sequence as PretokenizerSequence, Split
from tokenizers.normalizers import Sequence as NormalizerSequence, Replace, BertNormalizer, Strip


corpus_file = "corpus.txt"
special_tokens = [
    "<s>",
    "<pad>",
    "</s>",
    "<unk>"
]
for i in range(20):
    special_tokens.append(f"<disasm_function_{i}>")
    special_tokens.append(f"<disasm_string_{i}>")

tokenizer = Tokenizer(BPE())
- tokenizer.add_special_tokens(special_tokens)

tokenizer.normalizer = NormalizerSequence([
    Strip(),
    BertNormalizer(clean_text=True, strip_accents=True, lowercase=True),
    Replace(Regex("\s{2,}"), " "),
    Replace(" ", "<space>")
])
tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
])
+ tokenizer.add_special_tokens(special_tokens)
trainer = BpeTrainer(
    special_tokens=special_tokens, vocab_size=10000, min_frequency=2,
)
tokenizer.train(files=[corpus_file], trainer=trainer)

tokenizer.save("example_tokenizer.json")

So I tried this and for me it still gives exactly the same result. It works at test time (as did the previous version), but during training it is still merging across the special tokens.

You are right, sorry. Here is a PR with a fix, not sure why we never had that.