arcee-ai/mergekit

Broken tokenizer in Yi-34B merge

Closed this issue · 3 comments

I've been trying to merge two Yi-34B based builds using Arcee's hosted mergekit. The merge seems to be successful, with no errors shown, but no matter what tokenizer source I use, the result seems broken and I'm unable to convert to GGUF. I know there used to be a bug related to this, but I thought it was fixed.

This is the most recent YAML I used:

base_model: TeeZee/Kyllene-34B-v1.1
chat_template: auto
dtype: float16
merge_method: ties
models:
- model: TeeZee/Kyllene-34B-v1.1
  parameters:
    density: 0.5
    weight: 0.5
- model: Doctor-Shotgun/Nous-Capybara-limarpv3-34B
  parameters:
    density: 0.5
    weight: 0.5
parameters:
  int8_mask: true
  normalize: false
tokenizer_source: base

Hi! What do you mean by broken tokenizer? I did not sure if my tokenizer were broken. I got token in texts like "<|unused115|>", "<|unused026|>" in my message after merging model.

Hi! What do you mean by broken tokenizer? I did not sure if my tokenizer were broken. I got token in texts like "<|unused115|>", "<|unused026|>" in my message after merging model.

I was unable to convert the model to GGUF and quantize because of an error about token ids being out of range. There were tokens numbered 64000 & 64001 when the max was 63999.
I was finally able to fix this problem by redoing the merge with the added parameter "embed_slerp=true".

I too see a lot of unused tokens in the config, but I don't know if that's anything to worry about. So far, I haven't seen these show up in generated text.

After merging in #430 I'm able to merge the config you posted and successfully quantize the output model. Please do let me know if this recurs or you run into any similar problems!