hwo to get RoBERTaTokenizer vocab.json and also merge file

Question

hwo to get RoBERTaTokenizer vocab.json and also merge file

songtaoshi opened this issue 5 years ago · 12 comments

❓ Questions & Help

hello, I trained the robert on my customized corpus following the fairseq instruction. I am confused how to generate the robert vocab.json and also merge.txt because I want to use the pytorch-transformer RoBERTaTokenizer. I only have a dict.txt in my data

Answer 1 · 2019-08-23T01:00:41.000Z

@thomwolf @LysandreJik @julien-c

Answer 2 · 2019-08-23T12:50:19.000Z

Hi! RoBERTa's tokenizer is based on the GPT-2 tokenizer.

Please note that except if you have completely re-trained RoBERTa from scratch, there is usually no need to change the vocab.json and merges.txt file.

Currently we do not have a built-in way of creating your vocab/merges files, neither for GPT-2 nor for RoBERTa. I'm describing the process we followed for RoBERTa, hoping that you will be able to solve your problem following a similar process.

Encoding a sentence is done according to the following process:

Say you start with this text:

What's up with the tokenizer?

The tokenizer first tokenizes according to the merges file:

['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']

And then, according to the values in the vocab.json, these tokens are then replaced by their indices:

[   'What',     "'s",    'Ġup',  'Ġwith',   'Ġthe', 'Ġtoken',   'izer',      '?']
---- becomes ----
[     2061,      338,      510,      351,      262,    11241,     7509,       30]

The dict.txt file generated from RoBERTa actually modifies the vocab.json from the original GPT-2 by shifting the indices.

If you open the dict.txt file you should see values such as (the values shown here are the first values of the native RoBERTa dict.txt):

13 850314647
262 800385005
11 800251374
284 432911125

which are token indices ordered by the highest occurence. For the first example, the token 13 in the GPT-2 tokenizer is the token .: gpt2_tokenizer.encode('.') returns [13]

In order to get the appropriate RoBERTa vocab.json we remapped the original GPT-2 vocab.json with this dict. The first four values are the special tokens:

{"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}

Following those values, are the values from the dict.txt ordered by index. For example:

gpt2_tokenizer.decode(13) -> '.'       # INDEX 0 (13 is on the 1st line of the dict.txt)
gpt2_tokenizer.decode(262) -> ' the'   # INDEX 1 (262 is on the 2nd line of the dict.txt)
gpt2_tokenizer.decode(11) -> ','       # INDEX 2 (11 is on the third line of the dict.txt)
gpt2_tokenizer.decode(284) ->  to'     # INDEX 3 (284 is on the fourth line of the dict.txt)

The vocab then becomes:

{"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3, ".": 4, "Ġthe": 5, ",": 6, "Ġto": 7}

That's how you create the vocab.json. The merges.txt file is unchanged.

Answer 3 · 2019-08-23T16:01:48.000Z

@julien-c Thanks for your reply!

Hi, I am pre-training RoBERTa in my own corpus, which consists of numbers

4758 7647 16712 6299 11255 6068 695 23 19536 7142 7009 9655 10524 4864 7379 17348 7501 17225 14123 13711 7133 11255 21097 3277 6068 695 4190 1269 4526 12266 2161 17597 15274
23 6484 17225 8217 16374 11122 5592 21224 7251 11188 533 9685 11487 4246 19311 19851 8038 15822 9435 15274
1027 1269 14461 4815 12617 14123 3268 3390 8197 19019 16908 20958 15033 16541 19421 19429 7664 17253 4246 11123 1884 15274
5863 17166 21224 13159 2289 11944 8205 17083 13426 21224 17225 17186 14499 6225 16201 400 5635 3219 16498 15274

each separated line represents a paragraph

So I skip the BPE encode, I just binarize my data into language format, using

TEXT=examples/language_model/wikitext-103
fairseq-preprocess
--only-source
--trainpref $TEXT/wiki.train.tokens
--validpref $TEXT/wiki.valid.tokens
--testpref $TEXT/wiki.test.tokens \
--destdir data-bin/wikitext-103
--workers 20

The vocab.json I think I can construct by myself but the merges.txt I didn't use the BPE, So I wondering if I just use an empty file to mean no merging.

Answer 4 · 2019-10-22T16:40:05.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Answer 5 · 2019-12-13T05:41:23.000Z

@julien-c Thanks for your reply!

Hi, I am pre-training RoBERTa in my own corpus, which consists of numbers

4758 7647 16712 6299 11255 6068 695 23 19536 7142 7009 9655 10524 4864 7379 17348 7501 17225 14123 13711 7133 11255 21097 3277 6068 695 4190 1269 4526 12266 2161 17597 15274
23 6484 17225 8217 16374 11122 5592 21224 7251 11188 533 9685 11487 4246 19311 19851 8038 15822 9435 15274
1027 1269 14461 4815 12617 14123 3268 3390 8197 19019 16908 20958 15033 16541 19421 19429 7664 17253 4246 11123 1884 15274
5863 17166 21224 13159 2289 11944 8205 17083 13426 21224 17225 17186 14499 6225 16201 400 5635 3219 16498 15274

each separated line represents a paragraph

So I skip the BPE encode, I just binarize my data into language format, using

TEXT=examples/language_model/wikitext-103
fairseq-preprocess
--only-source
--trainpref $TEXT/wiki.train.tokens
--validpref $TEXT/wiki.valid.tokens
--testpref $TEXT/wiki.test.tokens
--destdir data-bin/wikitext-103
--workers 20

The vocab.json I think I can construct by myself but the merges.txt I didn't use the BPE, So I wondering if I just use an empty file to mean no merging.

I want to know this too

Answer 6 · 2020-01-07T07:08:05.000Z

U guys can get vocab.txt and merges.txt from:
https://huggingface.co/transformers/v1.1.0/_modules/pytorch_transformers/tokenization_roberta.html
the works still come from huggingface.

Answer 7 · 2020-08-14T16:32:45.000Z

@songtaoshi I have a similar problem. Did you get your issue resolved.

Answer 8 · 2021-09-18T23:24:00.000Z

For another new language and a totally new dataset, preparing my own merges.txt and vocab.json is for sure necessary:

Check this:
https://towardsdatascience.com/transformers-from-scratch-creating-a-tokenizer-7d7418adb403

this is a step-by-step tutorial on how to use "oscar" dataset to train your own byte-level bpe tokenizer (which exactly outputs "merges.txt" and "vocab.json".

1. data prepare

import datasets
dataset = datasets.load_dataset('oscar', 'unshuffled_deduplicated_la')
from tqdm.auto import tqdm
text_data = []
file_count = 0
for sample in tqdm(dataset['train']):
... sample = sample['text'].replace('\n', '')
... text_data.append(sample)
... if len(text_data) == 5000:
... with open(f'./oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
... fp.write('\n'.join(text_data))
... text_data = []
... file_count += 1
...
with open(f'./oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
... fp.write('\n'.join(text_data))
...
from pathlib import Path
paths = [str(x) for x in Path('./oscar_la').glob('*.txt')]
paths
['oscar_la/text_1.txt', 'oscar_la/text_2.txt', 'oscar_la/text_3.txt', 'oscar_la/text_0.txt']

2. train

from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=30522, min_frequency=2, special_tokens=['~~', '', '~~', '', ''])

3. save

tokenizer.save_model('./oscar_la/blbpe')
['./oscar_la/blbpe/vocab.json', './oscar_la/blbpe/merges.txt']

Answer 9 · 2022-02-09T11:11:01.000Z

@Xianchao-Wu
Thanks, that helped me a lot!

Answer 10 · 2022-02-09T11:11:25.000Z

您发给我的信件已经收到，我会尽快查收并回复您。Your e-mail has been received, I will reply as soon as possible.邢璐茜

Answer 11 · 2022-10-18T09:18:05.000Z

Can you please give any reference to the code or explain how can we generate tokens for a given using the merges.txt file?

Answer 12 · 2022-10-18T09:18:26.000Z

您发给我的信件已经收到，我会尽快查收并回复您。Your e-mail has been received, I will reply as soon as possible.邢璐茜