musixmatchresearch/umberto

Training protocol for Roberta model with SentencePiece

ksopyla opened this issue · 6 comments

Hi,
I try to train the Roberta model with fairseq library from scratch. I want to pretrain this model on polish text but can't find any good source which explains the details.

There is a readme which explains how to do this with BPE and English https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

But not everything is so obvious to me. First of all, I want to use sentencepiece instead BPE, you train your model with SentencePiece am I correct?
Could you share how the format of the vocab file for fairseq should look like?
My trained vocab has format

<unk>	0
▁,	-2.98959
▁.	-3.06552
▁w	-3.656
a	-3.99339
▁i	-4.16481
...

What is the target data format? The same as in original BERT?

[unk]
,
.
w
##a
i
...

My concern is also data preparation, how to preprocess and encode the data (text).

In tutorial they encode text with bpe

mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
for SPLIT in train valid test; do \
    python -m examples.roberta.multiprocessing_bpe_encoder \
        --encoder-json gpt2_bpe/encoder.json \
        --vocab-bpe gpt2_bpe/vocab.bpe \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

So, should I write own script which encodes my data with sentencepiece tokens

And then use

fairseq-preprocess \
    --only-source \
    --srcdict sentencepiece/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.sp \
    --validpref wikitext-103-raw/wiki.valid.sp \
    --testpref wikitext-103-raw/wiki.test.sp \
    --destdir data-bin/wikitext-103 \
    --workers 60

Could you also share some information about your settings

TOTAL_UPDATES=??    # Total number of training steps
WARMUP_UPDATES=??    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

Finnaly, could you share how many GPU you use, how long did it take to train model?
Any tips, warnings are welcome

Thank you in advance. :)

Ok, I will try to do a brief description.
Yes we used SentencePiece Tokenizer and we use it through command line.
SentencePiece implements two segmentation algorithms and one of them is BPE that is also the same used in Camembert.
With the command below, you train the sentence piece tokenizer on a very big corpus of data

# Train SentencePiece Tokenizer on large corpus
spm_train \
    --input=[raw_text_file] \
    --max_sentence_length= [ max length of a sentence you accept ]\
    --model_prefix=spm.bpe \
    --vocab_size=[8000, 16000, 32000, etc..] \
    --model_type=bpe \
    --shuffle_input_sentence=true \
    --pad_id=-1 \
    --input_sentence_size=[ choose a smaller amount of data randomly ]

Then you have to encode your data in the format that Fairseq training needs.

# Encode Data with SentencePiece Tokenizer
spm_encode \
    --model=spm.bpe.model \ [ model that is from output of sp training ]
    --extra_options=bos:eos \ [ saying that you want begin of sequence and end of sequence encoded ]
    --output_format=piece \ [ here you are telling that encoded data will be as tokens of spm ]
    < file.raw \ [ raw data in input]
    > file.bpe [ encoded data in output ]

Here you will have a dictionary in this format:
you have to change the separator from \t ( sentencepiece) to space because it's the notation expected by fairseq.
split your file.bpe in train.bpe, valid.bpe and test.bpe and preprocess your data.

fairseq-preprocess \
    --only-source \
    --srcdict sentencepiece.bpe.vocab \
    --trainpref train.bpe \
    --validpref valid.bpe \
    --testpref test.bpe \
    --destdir $DATA_DIR \
    --workers $N_WORKERS

For TOTAL_UPDATES we chose based on Roberta Paper Page 6. WARMUP_UPDATES was 10% of TOTAL_UPDATES, and total batch_size was 2k, but it depends on the number of GPUs, and also what kind of GPUs you want to use. we had 256 batch _size (16 MAX_SENTENCES x16 UPDATE_FREQ )for every GPU (8 GPUs) = 2048
Thanks

Close the issue but feel free to open it again

Hi @simonefrancia I have another question about data preparation. Original fairseq tutorial is based on wikitext103, sample below

 = Robert Boulter =

 Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . 

As you see, the text were preprocessed. It was tokenized and each token was surrounded with space (all the dots at the end of the sentence has space).
But when you use BPE this doesn't make much sense, right?
Do you preprocess text (file.raw in your example) for training model in a such way?

Hi @ksopyla,
in general this preprocessing is not necessary because during the SentencePiece training phase, the algorithm itself understands how to split text in order to optimize the coverage of the dictionary size that you decided at the beginning.
This is the power of dynamic tokenizers that are not static rule-based, but they are based on your data and if the data are a lot, they will probably learn better.
So I think you can leave text data in the original format.

My intuition was exactly as you write, thanks for confirmation.

Ok, I will try to do a brief description.
Yes we used SentencePiece Tokenizer and we use it through command line.
SentencePiece implements two segmentation algorithms and one of them is BPE that is also the same used in Camembert.
With the command below, you train the sentence piece tokenizer on a very big corpus of data

# Train SentencePiece Tokenizer on large corpus
spm_train \
    --input=[raw_text_file] \
    --max_sentence_length= [ max length of a sentence you accept ]\
    --model_prefix=spm.bpe \
    --vocab_size=[8000, 16000, 32000, etc..] \
    --model_type=bpe \
    --shuffle_input_sentence=true \
    --pad_id=-1 \
    --input_sentence_size=[ choose a smaller amount of data randomly ]

Then you have to encode your data in the format that Fairseq training needs.

# Encode Data with SentencePiece Tokenizer
spm_encode \
    --model=spm.bpe.model \ [ model that is from output of sp training ]
    --extra_options=bos:eos \ [ saying that you want begin of sequence and end of sequence encoded ]
    --output_format=piece \ [ here you are telling that encoded data will be as tokens of spm ]
    < file.raw \ [ raw data in input]
    > file.bpe [ encoded data in output ]

Here you will have a dictionary in this format:
you have to change the separator from \t ( sentencepiece) to space because it's the notation expected by fairseq.
split your file.bpe in train.bpe, valid.bpe and test.bpe and preprocess your data.

fairseq-preprocess \
    --only-source \
    --srcdict sentencepiece.bpe.vocab \
    --trainpref train.bpe \
    --validpref valid.bpe \
    --testpref test.bpe \
    --destdir $DATA_DIR \
    --workers $N_WORKERS

For TOTAL_UPDATES we chose based on Roberta Paper Page 6. WARMUP_UPDATES was 10% of TOTAL_UPDATES, and total batch_size was 2k, but it depends on the number of GPUs, and also what kind of GPUs you want to use. we had 256 batch _size (16 MAX_SENTENCES x16 UPDATE_FREQ )for every GPU (8 GPUs) = 2048
Thanks

Hi @simonefrancia, i'm training a roberta from scratch with this description (my dataset size is 6gb) and the MLM loss even decreases up to 50k steps, but when i evaluate this model in NER, the f1 score began a progressive decrease value after 16k steps. What could be causing this strange behavior?