Train tokenizer on integer lists, not strings

Question

Train tokenizer on integer lists, not strings

Closed this issue 23 days ago · 4 comments

Hi,

I was hoping to train a BPE tokenizer, but in my case I have lists of integers rather than strings. I'd essentially like to apply the merging rules to adjacent integers in these lists, rather than to subword characters. Is there a straightforward way to do this? The current setup seems to require strings.

Answer 1 · 2024-03-18T10:15:05.000Z

Bumping this, as this would make the usage of the lib more easy and straightforward for modalities other than text, e.g. molecules, DNA, music.

In MidiTok we basically map each integer to a byte to "bypass" this limitation. But this is not straightforward and adds overhead.

Edit: also it can only scale up to the number of unicode characters.

Answer 2 · 2024-03-27T19:24:45.000Z

Second this. I'm training tokenizers on malware bytes. At the moment, I have to map bytes to utf8 characters before sending through the tokenizers library. The tokenizers should work on any sequence, not just strings.

Answer 3 · 2024-04-12T09:50:58.000Z

@Narsil @ArthurZucker how difficult do you estimate this?

Answer 4 · 2024-05-13T01:50:33.000Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.