Incorrect [SEP] token ID

Hey!
I have noticed that in model.py you obtain the [SEP] mask with this line of code:

sequential_sentence_classification/sequential_sentence_classification/model.py

Lines 101 to 107 in 8ddc021

    
           if self.use_sep: 
        
               # The following code collects vectors of the SEP tokens from all the examples in the batch, 
        
               # and arrange them in one list. It does the same for the labels and confidences. 
        
               # TODO: replace 103 with '[SEP]' 
        
               sentences_mask = sentences['bert'] == 103  # mask for all the SEP tokens in the batch 
        
               embedded_sentences = embedded_sentences[sentences_mask]  # given batch_size x num_sentences_per_example x sent_len x vector_len 
        
                                                                           # returns num_sentences_per_batch x vector_len

Is 103 a correct value? I looked at the vocab.txt from the original BERT module and it seems that [SEP] token has an ID of 102 (since the vocab file is 0 -indexed). Is it a mistake or do you use a different vocab file than the original one?

Best,
Kacper

We are using SciBERT with SciVocab. In the SciBERT vocabulary the SEP token id is 103 but in the original BERT vocabulary it is 102. As you noticed, we currently hardcode this in the code but a better solution would be to get the SEP token id from the vocabulary.

Okay great, thanks for letting me know. I will close this issue as it was rather a question; hopefully, it will help others too :)

Hi, does it mean that if want to use RoBERTa type of models, your code is not directly usable?

In RoBERTa the vocabulary comes from sencepiece package and it is adapted to the train so that there is no uknown tokens, which means that there is no fixed vocab.

I wanted to use your code with the French BERT but it is based on RoBERTa.

This codebase currently only supports BERT/SCIBERT transformers and we haven't really tested it with other Transformers.

If you wanted to use other Transformers, some changes need to be made to the config, in addition to changing the SEP token id mentioned above.
Tokenizer:

sequential_sentence_classification/sequential_sentence_classification/config.jsonnet

Line 21 in 8ddc021

"token_indexers": {

Pretrained transformer:

sequential_sentence_classification/sequential_sentence_classification/config.jsonnet

Line 48 in 8ddc021

"bert": {

I think AllenNLP version also needs to be upgraded to a later version that supports RoBERTa. That may break some functionality in current codebase and may need some debugging to fix. A PR is appreciated if you ended up going that route.

Thank you for your prompt answer. I will do more digging. I will inform you for sure if get to something!

Hi @armancohan ,
I adapted your code to Allennlp 2.0. Quite a lot of changes were required. I cannot make it to reproduce your results exactly. I managed to get close playing with params a bit. I will send you a PR with more details soon.

In theory it would be applicable to RoBERTa or other transformers. I was wondering if you have any suggestion concerning which token to use if there is more than one intermediate token.

RoBERTa, for multiple senteces embedding uses two intermediate tokens:

<s> token token token </s></s> token token </s></s>token token token </s>

I tested with the first of intermediate tokens and the closing token and it seems to be promising. It poses problem for batch >1 as I didn't find a way to contruct a mask that indicates only one of two </s></s> for multiple batches.
If I take both </s> they will not corsspond to the number of sentences.

Any input on this is welcome.

	if self.use_sep:
	# The following code collects vectors of the SEP tokens from all the examples in the batch,
	# and arrange them in one list. It does the same for the labels and confidences.
	# TODO: replace 103 with '[SEP]'
	sentences_mask = sentences['bert'] == 103 # mask for all the SEP tokens in the batch
	embedded_sentences = embedded_sentences[sentences_mask] # given batch_size x num_sentences_per_example x sent_len x vector_len
	# returns num_sentences_per_batch x vector_len