Can't I use "additional tokens"?

Question

Can't I use "additional tokens"?

Opened this issue a year ago · 0 comments

Hello,
Thanks for amazing guide for fine-tuning tiny llama. Actually in my task I am willing to introduce a couple of extra tokens in my prompt such as:

You are a color expert... based on the multiple color description make a new one

These are the colors
"""
<|COL-1|> COL DISCRIPTION <|COL-1|>
<|COL-2|> COL DISCRIPTION <|COL-2|>
<|COL-3|> COL DISCRIPTION <|COL-3|>
"""
<|im_end|>

<|im_start|>user
Mix color 1 and 2<|im_end|>

<|im_start|>assistant
COL-1 + COL-2 = COL-12 ;)
<|im_end|>

To add the special tokens...

I have used the following code:

from tokenizers import AddedToken
SP_1 = "<|COL-1|>"
SP_2 = "<|COL-2|>"
SP_3 = "<|COL-3|>"


 tokenizer.add_special_tokens({
   "additional_special_tokens": [
                                 AddedToken(COL_1),
                                 AddedToken(COL_2),
                                 AddedToken(COL_3),
                                ] }
)

tokenizer.all_special_tokens
>>> ['<s>', '</s>', '<unk>', '[PAD]', '<|im_start|>', '<|im_end|>', '<|COL-1|>', '<|COL-2|>', '<|COL-3|>']

But problem...

When I hit the trainer.train() I face the following error:

... ommited error ...
For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

When I don't use the special tokens...

I can train the model properly!

It clearly suggests that the problem is with adding the special tokens. I have also followed this advice on StackOverflow here

Will you please suggest what should I do?
Thanks!