Add support for multilingual Viking models, please.

Question

Add support for multilingual Viking models, please.

JohnClaw opened this issue 17 days ago · 1 comments

Convert.py script can't make ggufs for these models:

https://huggingface.co/LumiOpen/Viking-7B
https://huggingface.co/LumiOpen/Viking-13B
https://huggingface.co/LumiOpen/Viking-33B

Answer 1 · 2024-05-16T01:15:03.000Z

Convert.py

You should use convert-hf-to-gguf.py for most models on HuggingFace. convert.py only supports Llama-like models.

However, the Viking models use a BPE tokenizer, and its pre-tokenizer uses a different regex which is not yet defined in llama.cpp. From its tokenizer.json:

{
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": " ?[^(\\s|[.,!?…。，、।۔،])]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },
}

This will need to be handled for pre-tokenization to work correctly for this model family.

(conversion currently fails when the pre-tokenizer doesn't match a known one)