hwo to get RoBERTaTokenizer vocab.json and also merge file
songtaoshi opened this issue · 12 comments
❓ Questions & Help
hello, I trained the robert on my customized corpus following the fairseq instruction. I am confused how to generate the robert vocab.json and also merge.txt because I want to use the pytorch-transformer RoBERTaTokenizer. I only have a dict.txt in my data
Hi! RoBERTa's tokenizer is based on the GPT-2 tokenizer.
Please note that except if you have completely re-trained RoBERTa from scratch, there is usually no need to change the vocab.json
and merges.txt
file.
Currently we do not have a built-in way of creating your vocab/merges files, neither for GPT-2 nor for RoBERTa. I'm describing the process we followed for RoBERTa, hoping that you will be able to solve your problem following a similar process.
Encoding a sentence is done according to the following process:
Say you start with this text:
What's up with the tokenizer?
The tokenizer first tokenizes according to the merges file:
['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']
And then, according to the values in the vocab.json
, these tokens are then replaced by their indices:
[ 'What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']
---- becomes ----
[ 2061, 338, 510, 351, 262, 11241, 7509, 30]
The dict.txt file generated from RoBERTa actually modifies the vocab.json
from the original GPT-2 by shifting the indices.
If you open the dict.txt file you should see values such as (the values shown here are the first values of the native RoBERTa dict.txt
):
13 850314647
262 800385005
11 800251374
284 432911125
which are token indices ordered by the highest occurence. For the first example, the token 13
in the GPT-2 tokenizer is the token .
: gpt2_tokenizer.encode('.')
returns [13]
In order to get the appropriate RoBERTa vocab.json
we remapped the original GPT-2 vocab.json
with this dict. The first four values are the special tokens:
{"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}
Following those values, are the values from the dict.txt
ordered by index. For example:
gpt2_tokenizer.decode(13) -> '.' # INDEX 0 (13 is on the 1st line of the dict.txt)
gpt2_tokenizer.decode(262) -> ' the' # INDEX 1 (262 is on the 2nd line of the dict.txt)
gpt2_tokenizer.decode(11) -> ',' # INDEX 2 (11 is on the third line of the dict.txt)
gpt2_tokenizer.decode(284) -> to' # INDEX 3 (284 is on the fourth line of the dict.txt)
The vocab then becomes:
{"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3, ".": 4, "Ġthe": 5, ",": 6, "Ġto": 7}
That's how you create the vocab.json
. The merges.txt
file is unchanged.
@julien-c Thanks for your reply!
Hi, I am pre-training RoBERTa in my own corpus, which consists of numbers
4758 7647 16712 6299 11255 6068 695 23 19536 7142 7009 9655 10524 4864 7379 17348 7501 17225 14123 13711 7133 11255 21097 3277 6068 695 4190 1269 4526 12266 2161 17597 15274
23 6484 17225 8217 16374 11122 5592 21224 7251 11188 533 9685 11487 4246 19311 19851 8038 15822 9435 15274
1027 1269 14461 4815 12617 14123 3268 3390 8197 19019 16908 20958 15033 16541 19421 19429 7664 17253 4246 11123 1884 15274
5863 17166 21224 13159 2289 11944 8205 17083 13426 21224 17225 17186 14499 6225 16201 400 5635 3219 16498 15274
each separated line represents a paragraph
So I skip the BPE encode, I just binarize my data into language format, using
TEXT=examples/language_model/wikitext-103
fairseq-preprocess
--only-source
--trainpref $TEXT/wiki.train.tokens
--validpref $TEXT/wiki.valid.tokens
--testpref $TEXT/wiki.test.tokens \
--destdir data-bin/wikitext-103
--workers 20
The vocab.json I think I can construct by myself but the merges.txt I didn't use the BPE, So I wondering if I just use an empty file to mean no merging.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@julien-c Thanks for your reply!
Hi, I am pre-training RoBERTa in my own corpus, which consists of numbers
4758 7647 16712 6299 11255 6068 695 23 19536 7142 7009 9655 10524 4864 7379 17348 7501 17225 14123 13711 7133 11255 21097 3277 6068 695 4190 1269 4526 12266 2161 17597 15274
23 6484 17225 8217 16374 11122 5592 21224 7251 11188 533 9685 11487 4246 19311 19851 8038 15822 9435 15274
1027 1269 14461 4815 12617 14123 3268 3390 8197 19019 16908 20958 15033 16541 19421 19429 7664 17253 4246 11123 1884 15274
5863 17166 21224 13159 2289 11944 8205 17083 13426 21224 17225 17186 14499 6225 16201 400 5635 3219 16498 15274each separated line represents a paragraph
So I skip the BPE encode, I just binarize my data into language format, using
TEXT=examples/language_model/wikitext-103
fairseq-preprocess
--only-source
--trainpref $TEXT/wiki.train.tokens
--validpref $TEXT/wiki.valid.tokens
--testpref $TEXT/wiki.test.tokens
--destdir data-bin/wikitext-103
--workers 20The vocab.json I think I can construct by myself but the merges.txt I didn't use the BPE, So I wondering if I just use an empty file to mean no merging.
I want to know this too
U guys can get vocab.txt and merges.txt from:
https://huggingface.co/transformers/v1.1.0/_modules/pytorch_transformers/tokenization_roberta.html
the works still come from huggingface.
@songtaoshi I have a similar problem. Did you get your issue resolved.
For another new language and a totally new dataset, preparing my own merges.txt and vocab.json is for sure necessary:
Check this:
https://towardsdatascience.com/transformers-from-scratch-creating-a-tokenizer-7d7418adb403
this is a step-by-step tutorial on how to use "oscar" dataset to train your own byte-level bpe tokenizer (which exactly outputs "merges.txt" and "vocab.json".
1. data prepare
import datasets
dataset = datasets.load_dataset('oscar', 'unshuffled_deduplicated_la')
from tqdm.auto import tqdm
text_data = []
file_count = 0
for sample in tqdm(dataset['train']):
... sample = sample['text'].replace('\n', '')
... text_data.append(sample)
... if len(text_data) == 5000:
... with open(f'./oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
... fp.write('\n'.join(text_data))
... text_data = []
... file_count += 1
...
with open(f'./oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
... fp.write('\n'.join(text_data))
...
from pathlib import Path
paths = [str(x) for x in Path('./oscar_la').glob('*.txt')]
paths
['oscar_la/text_1.txt', 'oscar_la/text_2.txt', 'oscar_la/text_3.txt', 'oscar_la/text_0.txt']
2. train
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=30522, min_frequency=2, special_tokens=['', '', '', '', ''])
3. save
tokenizer.save_model('./oscar_la/blbpe')
['./oscar_la/blbpe/vocab.json', './oscar_la/blbpe/merges.txt']
@Xianchao-Wu
Thanks, that helped me a lot!
Can you please give any reference to the code or explain how can we generate tokens for a given using the merges.txt file?