How to add some new special tokens to a pretrained tokenizer?

Question

How to add some new special tokens to a pretrained tokenizer?

ky941122 opened this issue 5 years ago · 27 comments

Hi guys. I want to add some new special tokens like [XXX] to a pretrained ByteLevelBPETokenizer, but I can't find how to do this in python.

Answer 1 · 2020-04-24T17:57:55.000Z

Looks like you can add special tokens to the tokenizer via:

tokenizer = ByteLevelBPETokenizer(...)  
tokenizer.add_special_tokens(["[XXX]"])

Answer 2 · 2020-04-25T07:29:21.000Z

Looks like you can add special tokens to the tokenizer via:
tokenizer = ByteLevelBPETokenizer(...)  
tokenizer.add_special_tokens(["[XXX]"])

thanks for replying, but this method seems to expand the vocab. I also use a pretrained roberta with this tokenizer in transformers , how can I make the pretrained model's embedding layer to match the tokenizer after adding some tokens?

Answer 3 · 2020-04-25T17:29:33.000Z

If you're using a pretrained roberta model, it will only work on the tokens it recognizes in it's internal set of embeddings thats paired to a given token id (which you can get from the pretrained tokenizer for roberta in the transformers library). I don't see any reason to use a different tokenizer on a pretrained model other than the one provided by the transformers library. I guess I'm confused about what you're trying to achieve.

Answer 4 · 2020-05-12T14:20:13.000Z

There is no way to add a new token to a tokenizer without changing the vocabulary. Otherwise, as @jaymody pointed out, .add_special_tokens is the way to go.

Answer 5 · 2020-06-21T22:06:51.000Z

We can add a special token to a tokenizer. How to expand embedding layer in pretrained model for the new token? It can be randomly initialized and optionally finetuned.

Answer 6 · 2020-08-06T13:58:34.000Z

Following

We can add a special token to a tokenizer. How to expand embedding layer in pretrained model for the new token? It can be randomly initialized and optionally finetuned.

Answer 7 · 2020-08-18T12:41:14.000Z

We can add a special token to a tokenizer. How to expand embedding layer in pretrained model for the new token? It can be randomly initialized and optionally finetuned.

Following

Answer 8 · 2020-08-18T12:49:15.000Z

Try this and give feedback:

special_tokens_dict = {'additional_special_tokens': ['[C1]','[C2]','[C3]','[C4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

Answer 9 · 2020-08-18T12:53:06.000Z

Also, feel free to ask on https://discuss.huggingface.co which is more suited for general questions and discussions.

Answer 10 · 2020-08-18T13:14:03.000Z

Try this and give feedback:

special_tokens_dict = {'additional_special_tokens': ['[C1]','[C2]','[C3]','[C4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

Thanks but how are new embeddings initialised? Are they random? Won't that be wrong

Answer 11 · 2020-08-18T13:19:54.000Z

They are random and you can set them as you like or finetune.

Answer 12 · 2020-11-16T10:46:36.000Z

@djstrong Just a question about adding special tokens as above, would it change the embedding values of the existing words that are in the pretrained tokenizer?

Just trying to make sure that it's only the values that are randomly intialised.

Answer 13 · 2020-11-16T11:46:11.000Z

No, it would not change embeddings in pretrained model.

Answer 14 · 2021-04-30T23:24:09.000Z

So, If I finetune a model, then the embedding of the new token will be learned and of the old tokens will stay the same? @sachinruk

Answer 15 · 2021-05-06T20:02:56.000Z

So, If I finetune a model, then the embedding of the new token will be learned and of the old tokens will stay the same? @sachinruk

No, when you finetune, any embedding can change by definition. But the change "amount" will depend on your data set size.

Answer 16 · 2022-04-03T13:29:25.000Z

Hi all,

I have try the method above, that worked! I want to ask one more question. When I reload my model from my checkpoint, it gave me the size mismatch error, does someone know how to fix this? Thank you!

Answer 17 · 2022-04-04T07:54:29.000Z

@AlafateABULIMITI Can you provide more context about this error itself, and ideally some minimal example to reproduce ?

Answer 18 · 2022-04-04T13:11:51.000Z

@Narsil, Yes.
Actually, I train one model and reload in another structure of model with 2 separate steps. And in my training set (dialogue dataset), there are some special tokens (speaker_ids) that I need to add them to the tokenizer (I add 2 tokens here), I did exactly what is mentioned above:

special_tokens_dict = {"additional_special_tokens": special_tokens}
num_added_toks = okenizer.add_special_tokens(special_tokens_dict)

and after I init my model then resize token embeddings with:

model = Generator(lr=lr, model_type=model_type)
model.bart.resize_token_embeddings(len(tokenizer))

I trained this Generator successfully, but when I tried to load it in another model structure like:

class Model(pl.LightningModule)
def __init__(self,  gen_checkpoint_path: str):
self.generator = Generator.load_from_checkpoint(
            gen_checkpoint_path, lr=self.lr, model_type=self.model_type,
        )

it gave me the mismatch error:


Error(s) in loading state_dict for Generator:        size mismatch for model.final_logits_bias: copying a param with shape torch.Size([1, 50267]) from checkpoint, the shape in current model is torch.Size([1, 50265]).        size mismatch for model.shared.weight: copying a param with shape torch.Size([50267, 768]) from checkpoint, the shape in current model is torch.Size([50265, 768]).

mismatch size is exactly 2 which is the number of special tokens I added. I use pytorch-lightning by the way. Thank you in advance.

Answer 19 · 2022-04-04T13:50:27.000Z

I am sorry but your code is missing some steps to reproduce, when are you actually saving, before or after resize_token_embeddings ?

I did

from transformers import BartForConditionalGeneration

model = BartForConditionalGeneration.from_pretrained(
     "hf-internal-testing/tiny-random-bart"
 )
# embeddings are size 1000
model.resize_token_embeddings(2000)
# Now they are of size 2000
model.save_pretrained("./test")
model2 = BartForConditionalGeneration.from_pretrained("./test')
# This works fine and model2 is indeed 2000 of embedding matrix.

Answer 20 · 2022-04-04T14:29:32.000Z

as I used the pytorch-lightning for the training my code is organized like this for the training and saving model:

gen_ckpt = ModelCheckpoint(
            monitor="val_loss",
            mode="min",
            save_top_k=1,
            filename="val_loss_{val_loss:.2f}_epoch_{epoch:02d}",
            auto_insert_metric_name=False,
        )
generator_trainer = Trainer(
            gpus=-1,
            precision=16,
            logger=tb_logger,
            max_epochs=epoch,
            callbacks=[gen_ckpt],
        )

actually, the Trainer model will automatically save the best checkpoint directly, so I didn't pass on the save_pretrained step. and I tried to use pytorch way to reload the model with

           self.generator = Generator(lr=self.lr, model_type=self.model_type)
            gen_ckpt = torch.load(gen_checkpoint_path)
            self.generator.load_state_dict(gen_ckpt["state_dict"])

it didn't work. Maybe I can look how manually save model with save_pretrained.

Answer 21 · 2022-04-04T16:33:27.000Z

Maybe I can look how manually save model with save_pretrained.

Maybe, sorry I don't really know pytorch-lightning, I have used it in the past, it's a great library, but I have the intuition something is not happening correctly in the interaction of pytorch-lightning and transformers here.

Btw, next time don't hesitate to open an new issue with a reference to the old one, it helps make issues become more searchable.

Cheers.

Answer 22 · 2022-04-04T19:36:39.000Z

Thank you @Narsil !

Answer 23 · 2022-08-11T14:30:04.000Z

Try this and give feedback:

special_tokens_dict = {'additional_special_tokens': ['[C1]','[C2]','[C3]','[C4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

this is measleading. It doesn't add them, it overwrites them afaik.

All I want to do is add all standard special tokens in case they aren't there e.g. <s> sep token is not in t5 for some reason.

I wrote this but not super happy with it.

def does_t5_have_sep_token():
    tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained('t5-small')
    assert isinstance(tokenizer, PreTrainedTokenizerFast)
    print(tokenizer)
    print(f'{len(tokenizer)=}')
    # print(f'{tokenizer.all_special_tokens=}')
    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.all_special_tokens=}')

    special_tokens_dict = {'additional_special_tokens': ['<bos>', '<cls>', '<s>'] + tokenizer.all_special_tokens }
    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.all_special_tokens=}')



if __name__ == '__main__':
    does_t5_have_sep_token()
    print('Done\a')

Answer 24 · 2022-08-11T14:32:58.000Z

https://discuss.huggingface.co/t/how-to-add-all-standard-special-tokens-to-my-tokenizer-and-model/21529, https://stackoverflow.com/questions/73322462/how-to-add-all-standard-special-tokens-to-my-tokenizer-and-model

Answer 25 · 2023-01-31T07:25:32.000Z

@djstrong hi, can you show me how to setup them to constant?

They are random and you can set them as you like or finetune.

Answer 26 · 2023-02-10T11:11:01.000Z

No need to add anything. Just change the vocab.txt and set the "additional_special_tokens" arg in BertTokenizer.from_pretrained().

In this way, you do not need to resize the model embedding layer, just utilize the [unused*] tokens.

Ref: https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.from_pretrained

Answer 27 · 2023-02-10T11:26:00.000Z

No need to add anything. Just change the vocab.txt and set the "additional_special_tokens" arg in BertTokenizer.from_pretrained().

Ref: https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.from_pretrained

Thank you ^^. I have found it. But I am looking for the way to set the representation of special tokens constant after training one task.

I am working on BERT model in relation extraction problems in continuous learning context. My model consists of 2 main blocks: 1 is the BERT encoder and 1 is the projection head.

Pairs (E11, E12) and (E21, E22) will wrap between entities.
I want the representation of the above 4 tokens, after going through the BERT encoder block through each Task, will have exactly the same representation. How should I do it?, thanks.