Add support for multilingual Viking models, please.
JohnClaw opened this issue · 1 comments
JohnClaw commented
Convert.py script can't make ggufs for these models:
https://huggingface.co/LumiOpen/Viking-7B
https://huggingface.co/LumiOpen/Viking-13B
https://huggingface.co/LumiOpen/Viking-33B
compilade commented
Convert.py
You should use convert-hf-to-gguf.py
for most models on HuggingFace. convert.py
only supports Llama-like models.
However, the Viking models use a BPE tokenizer, and its pre-tokenizer uses a different regex which is not yet defined in llama.cpp
. From its tokenizer.json
:
{
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": " ?[^(\\s|[.,!?…。,、।۔،])]+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "Digits",
"individual_digits": true
},
{
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": false
}
]
},
}
This will need to be handled for pre-tokenization to work correctly for this model family.
(conversion currently fails when the pre-tokenizer doesn't match a known one)