Wrong pieces for control symbols after loading SentencepieceProcessor from official model
JanKaul opened this issue · 2 comments
I'm trying to use albert for a question answering task. Therefore I want to encode my input text with sentencepiece to use it as input for the albert model. I initialize the sentencepiece model by loading the model file from one of the official tar files. The encoding seems to work fine except for the control symbols. The input [CLS]
gets encoded into three pieces. The expected behavior would be to get one piece. The same happens for [SEP]
.
Here is an example:
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file='albert/albert_base_v2/albert_base/30k-clean.model')
print(sp.encode('[CLS]', out_type=str))
Ouptut:
'▁[', 'CLS', ']'
Am I doing something wrong? Is there a way to specify the control symbols without training the model? I would like to avoid training the model every time I load the model. I would really appreciate your help. Thank you
@JanKaul That's actually true in my experimentation, I don't know why that is the case. Maybe in a little future I could train the spm model myself from the source.
I might have an idea why that's the case. You have to distinguish between user defined symbols and control symbols. According to the documentation:
- user defined symbols: Always treated as one token in any context. These symbols can appear in the input sentence.
- control symbol: We only reserve ids for these tokens. Even if these tokens appear in the input text, they are not handled as one token. User needs to insert ids explicitly after encoding.
I think [CLS]
and [SEP]
are added as control symbols and have to be added manually after encoding.