Can't I use "additional tokens"?
Opened this issue · 0 comments
AayushSameerShah commented
Hello,
Thanks for amazing guide for fine-tuning tiny llama. Actually in my task I am willing to introduce a couple of extra tokens in my prompt such as:
You are a color expert... based on the multiple color description make a new one
These are the colors
"""
<|COL-1|> COL DISCRIPTION <|COL-1|>
<|COL-2|> COL DISCRIPTION <|COL-2|>
<|COL-3|> COL DISCRIPTION <|COL-3|>
"""
<|im_end|>
<|im_start|>user
Mix color 1 and 2<|im_end|>
<|im_start|>assistant
COL-1 + COL-2 = COL-12 ;)
<|im_end|>
To add the special tokens...
I have used the following code:
from tokenizers import AddedToken
SP_1 = "<|COL-1|>"
SP_2 = "<|COL-2|>"
SP_3 = "<|COL-3|>"
tokenizer.add_special_tokens({
"additional_special_tokens": [
AddedToken(COL_1),
AddedToken(COL_2),
AddedToken(COL_3),
] }
)
tokenizer.all_special_tokens
>>> ['<s>', '</s>', '<unk>', '[PAD]', '<|im_start|>', '<|im_end|>', '<|COL-1|>', '<|COL-2|>', '<|COL-3|>']
But problem...
When I hit the trainer.train()
I face the following error:
... ommited error ...
For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
When I don't use the special tokens...
I can train the model properly!
It clearly suggests that the problem is with adding the special tokens. I have also followed this advice on StackOverflow here
Will you please suggest what should I do?
Thanks!