Null vocab_file Issue with mistral v03 based models when using union tokenizer source

Question

Null vocab_file Issue with mistral v03 based models when using union tokenizer source

guillermo-gabrielli-fer opened this issue 4 months ago · 2 comments

guillermo-gabrielli-fer commented 4 months ago

Environment

Conda environment:
python=3.10
mergekit commit f086664 (latest as of yesterday)
transformers from git @ git+https://github.com/huggingface/transformers 85817d98fb60977c97e3014196a462b732d2ed1a (latest as of yesterday)

Same issue with the transformers version installed by mergekit, I think it's 4.44

Issue

When merging two models based on mistral v03 base, at saving the base tokenizer to avoid mutating it (" # HACK: save base tokenizer to temp dir and reload to avoid mutating base_tok") , it fails to load it back.

configuration file (these were not the models I was trying originally but they reproduce the issue):

models:
  - model: mistralai/Mistral-7B-v0.3
  - model: mistralai/Mistral-7B-Instruct-v0.3
merge_method: slerp
base_model: mistralai/Mistral-7B-v0.3
tokenizer:
  source: union
parameters:
  t:
    - value: 0.8
dtype: bfloat16

Originally I was trying to merge the base model with one with a custom tokenizer with the same vocabulary size but different tokens, I can link the model if needed, but I'm having the same issue with any Mistral v0.3 based model, so the custom tokenizer doesn't appear to be the issue.

Exception:
mergekit-yaml report_issue_mistral.yaml EXAMPLE_MISTRAL_ISSUE/ --out-shard-size 1B --cuda --lazy-unpickle -v

mergekit/mergekit/tokenizer/build.py", line 155, in build_union_tokenizer
    res = transformers.AutoTokenizer.from_pretrained(
[......]
transformers/models/llama/tokenization_llama.py", line 201, in get_spm_processor
    with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType

I could get past that error by saving also as legacy_format=True, but then it shows:

mergekit/mergekit/tokenizer/embed.py", line 62, in execute
    token_configs = dict(**self.tokens) or {}
TypeError: dict() argument after ** must be a mapping, not NoneType

I could get the merge to finish by moving the {} fallback inside the dict, but I'm not sure yet if the result is correct.

tracebacks.txt

pip_freeze.txt

Answer 1 · 2024-08-28T12:12:58.000Z

I faced the same issue, probably this might fix
HACK: save base tokenizer to temp dir and reload to avoid mutating base_tok
with tempfile.TemporaryDirectory() as p:
base_tok.save_pretrained(p, legacy_format=True, safe_serialization=True)
res = transformers.AutoTokenizer.from_pretrained(
p, use_fast=True, trust_remote_code=trust_remote_code
)

Answer 2 · 2024-08-28T14:52:51.000Z

I found also an issue when trying to quantize the resulting model using lama cpp.
The temp folder in this case is deleted immediately outside the code black which results in the loss of the tokenizer vocab file.