huggingface/tokenizers

How to add some new special tokens to a pretrained tokenizer?

ky941122 opened this issue ยท 27 comments

Hi guys. I want to add some new special tokens like [XXX] to a pretrained ByteLevelBPETokenizer, but I can't find how to do this in python.

Looks like you can add special tokens to the tokenizer via:

tokenizer = ByteLevelBPETokenizer(...)  
tokenizer.add_special_tokens(["[XXX]"])

Looks like you can add special tokens to the tokenizer via:

tokenizer = ByteLevelBPETokenizer(...)  
tokenizer.add_special_tokens(["[XXX]"])

thanks for replying, but this method seems to expand the vocab. I also use a pretrained roberta with this tokenizer in transformers , how can I make the pretrained model's embedding layer to match the tokenizer after adding some tokens?

If you're using a pretrained roberta model, it will only work on the tokens it recognizes in it's internal set of embeddings thats paired to a given token id (which you can get from the pretrained tokenizer for roberta in the transformers library). I don't see any reason to use a different tokenizer on a pretrained model other than the one provided by the transformers library. I guess I'm confused about what you're trying to achieve.

n1t0 commented

There is no way to add a new token to a tokenizer without changing the vocabulary. Otherwise, as @jaymody pointed out, .add_special_tokens is the way to go.

We can add a special token to a tokenizer. How to expand embedding layer in pretrained model for the new token? It can be randomly initialized and optionally finetuned.

Ar9av commented

Following

We can add a special token to a tokenizer. How to expand embedding layer in pretrained model for the new token? It can be randomly initialized and optionally finetuned.

We can add a special token to a tokenizer. How to expand embedding layer in pretrained model for the new token? It can be randomly initialized and optionally finetuned.

Following

Try this and give feedback:

special_tokens_dict = {'additional_special_tokens': ['[C1]','[C2]','[C3]','[C4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
n1t0 commented

Also, feel free to ask on https://discuss.huggingface.co which is more suited for general questions and discussions.

Ar9av commented

Try this and give feedback:

special_tokens_dict = {'additional_special_tokens': ['[C1]','[C2]','[C3]','[C4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

Thanks but how are new embeddings initialised? Are they random? Won't that be wrong

They are random and you can set them as you like or finetune.

@djstrong Just a question about adding special tokens as above, would it change the embedding values of the existing words that are in the pretrained tokenizer?

Just trying to make sure that it's only the values that are randomly intialised.

No, it would not change embeddings in pretrained model.

So, If I finetune a model, then the embedding of the new token will be learned and of the old tokens will stay the same? @sachinruk

So, If I finetune a model, then the embedding of the new token will be learned and of the old tokens will stay the same? @sachinruk

No, when you finetune, any embedding can change by definition. But the change "amount" will depend on your data set size.

Hi all,

I have try the method above, that worked! I want to ask one more question. When I reload my model from my checkpoint, it gave me the size mismatch error, does someone know how to fix this? Thank you!

@AlafateABULIMITI Can you provide more context about this error itself, and ideally some minimal example to reproduce ?

@Narsil, Yes.
Actually, I train one model and reload in another structure of model with 2 separate steps. And in my training set (dialogue dataset), there are some special tokens (speaker_ids) that I need to add them to the tokenizer (I add 2 tokens here), I did exactly what is mentioned above:

special_tokens_dict = {"additional_special_tokens": special_tokens}
num_added_toks = okenizer.add_special_tokens(special_tokens_dict)

and after I init my model then resize token embeddings with:

model = Generator(lr=lr, model_type=model_type)
model.bart.resize_token_embeddings(len(tokenizer))

I trained this Generator successfully, but when I tried to load it in another model structure like:

class Model(pl.LightningModule)
def __init__(self,  gen_checkpoint_path: str):
self.generator = Generator.load_from_checkpoint(
            gen_checkpoint_path, lr=self.lr, model_type=self.model_type,
        )

it gave me the mismatch error:


Error(s) in loading state_dict for Generator:        size mismatch for model.final_logits_bias: copying a param with shape torch.Size([1, 50267]) from checkpoint, the shape in current model is torch.Size([1, 50265]).        size mismatch for model.shared.weight: copying a param with shape torch.Size([50267, 768]) from checkpoint, the shape in current model is torch.Size([50265, 768]).

mismatch size is exactly 2 which is the number of special tokens I added. I use pytorch-lightning by the way. Thank you in advance.

I am sorry but your code is missing some steps to reproduce, when are you actually saving, before or after resize_token_embeddings ?

I did

from transformers import BartForConditionalGeneration

model = BartForConditionalGeneration.from_pretrained(
     "hf-internal-testing/tiny-random-bart"
 )
# embeddings are size 1000
model.resize_token_embeddings(2000)
# Now they are of size 2000
model.save_pretrained("./test")
model2 = BartForConditionalGeneration.from_pretrained("./test')
# This works fine and model2 is indeed 2000 of embedding matrix.

as I used the pytorch-lightning for the training my code is organized like this for the training and saving model:

gen_ckpt = ModelCheckpoint(
            monitor="val_loss",
            mode="min",
            save_top_k=1,
            filename="val_loss_{val_loss:.2f}_epoch_{epoch:02d}",
            auto_insert_metric_name=False,
        )
generator_trainer = Trainer(
            gpus=-1,
            precision=16,
            logger=tb_logger,
            max_epochs=epoch,
            callbacks=[gen_ckpt],
        )

actually, the Trainer model will automatically save the best checkpoint directly, so I didn't pass on the save_pretrained step. and I tried to use pytorch way to reload the model with

           self.generator = Generator(lr=self.lr, model_type=self.model_type)
            gen_ckpt = torch.load(gen_checkpoint_path)
            self.generator.load_state_dict(gen_ckpt["state_dict"])

it didn't work. Maybe I can look how manually save model with save_pretrained.

Maybe I can look how manually save model with save_pretrained.

Maybe, sorry I don't really know pytorch-lightning, I have used it in the past, it's a great library, but I have the intuition something is not happening correctly in the interaction of pytorch-lightning and transformers here.

Btw, next time don't hesitate to open an new issue with a reference to the old one, it helps make issues become more searchable.

Cheers.

Thank you @Narsil !

Try this and give feedback:

special_tokens_dict = {'additional_special_tokens': ['[C1]','[C2]','[C3]','[C4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

this is measleading. It doesn't add them, it overwrites them afaik.

All I want to do is add all standard special tokens in case they aren't there e.g. <s> sep token is not in t5 for some reason.

I wrote this but not super happy with it.

def does_t5_have_sep_token():
    tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained('t5-small')
    assert isinstance(tokenizer, PreTrainedTokenizerFast)
    print(tokenizer)
    print(f'{len(tokenizer)=}')
    # print(f'{tokenizer.all_special_tokens=}')
    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.all_special_tokens=}')

    special_tokens_dict = {'additional_special_tokens': ['<bos>', '<cls>', '<s>'] + tokenizer.all_special_tokens }
    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.all_special_tokens=}')



if __name__ == '__main__':
    does_t5_have_sep_token()
    print('Done\a')

@djstrong hi, can you show me how to setup them to constant?

They are random and you can set them as you like or finetune.

No need to add anything. Just change the vocab.txt and set the "additional_special_tokens" arg in BertTokenizer.from_pretrained().

In this way, you do not need to resize the model embedding layer, just utilize the [unused*] tokens.

Ref: https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.from_pretrained

No need to add anything. Just change the vocab.txt and set the "additional_special_tokens" arg in BertTokenizer.from_pretrained().

Ref: https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.from_pretrained

Thank you ^^. I have found it. But I am looking for the way to set the representation of special tokens constant after training one task.

I am working on BERT model in relation extraction problems in continuous learning context. My model consists of 2 main blocks: 1 is the BERT encoder and 1 is the projection head.

Pairs (E11, E12) and (E21, E22) will wrap between entities.
I want the representation of the above 4 tokens, after going through the BERT encoder block through each Task, will have exactly the same representation. How should I do it?, thanks.