guidance-ai/guidance

"ValueError: Failed to load model from file" for new Phi-3 models

Closed this issue · 2 comments

I can use Guidance with Phi-3-mini which was announced a while ago, but with the new ones (ϕ-3-medium class) I get:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <placeholder>/test_guidance.py:33
     30         return "<|end|>\n"
     32 # tlm = LlamaCppChat(
---> 33 tlm = Phi3Chat(
     34     # model="<placeholder>/LLM/models/Phi-3-mini-4k-instruct-q4.gguf",
     35     model="<placeholder>/LLM/models/phi-3-medium-4k-instruct.Q4_0.gguf",
     36     n_gpu_layers=128,
     37     seed=42,
     38     n_ctx=4096,
     39     use_mlock=True,
     40     no_mmap=True,
     41     echo=True,
     42 )
     44 class Llama3Chat(LlamaCpp, Chat):
     45     def get_role_start(self, role_name, **kwargs): # type: ignore

File <placeholder>/.venv/lib/python3.12/site-packages/guidance/models/llama_cpp/_llama_cpp.py:229, in LlamaCpp.__init__(self, model, echo, compute_log_probs, api_key, chat_template, **llama_cpp_kwargs)
    227     engine = RemoteEngine(model, api_key=api_key, **llama_cpp_kwargs)
    228 else:
--> 229     engine = LlamaCppEngine(
    230         model, compute_log_probs=compute_log_probs, chat_template=chat_template, **llama_cpp_kwargs
    231     )
    233 super().__init__(engine, echo=echo)

File <placeholder>/.venv/lib/python3.12/site-packages/guidance/models/llama_cpp/_llama_cpp.py:122, in LlamaCppEngine.__init__(self, model, compute_log_probs, chat_template, **kwargs)
    117         kwargs["verbose"] = (
    118             True  # llama-cpp-python can't hide output in this case
    119         )
    121     with normalize_notebook_stdout_stderr():
--> 122         self.model_obj = llama_cpp.Llama(model_path=model, logits_all=True, **kwargs)
    123 elif isinstance(model, llama_cpp.Llama):
    124     self.model = model.__class__.__name__

File <placeholder>/.venv/lib/python3.12/site-packages/llama_cpp/llama.py:338, in Llama.__init__(self, model_path, n_gpu_layers, split_mode, main_gpu, tensor_split, vocab_only, use_mmap, use_mlock, kv_overrides, seed, n_ctx, n_batch, n_threads, n_threads_batch, rope_scaling_type, pooling_type, rope_freq_base, rope_freq_scale, yarn_ext_factor, yarn_attn_factor, yarn_beta_fast, yarn_beta_slow, yarn_orig_ctx, logits_all, embedding, offload_kqv, flash_attn, last_n_tokens_size, lora_base, lora_scale, lora_path, numa, chat_format, chat_handler, draft_model, tokenizer, type_k, type_v, verbose, **kwargs)
    335 if not os.path.exists(model_path):
    336     raise ValueError(f"Model path does not exist: {model_path}")
--> 338 self._model = _LlamaModel(
    339     path_model=self.model_path, params=self.model_params, verbose=self.verbose
    340 )
    342 # Override tokenizer
    343 self.tokenizer_ = tokenizer or LlamaTokenizer(self)

File <placeholder>/.venv/lib/python3.12/site-packages/llama_cpp/_internals.py:57, in _LlamaModel.__init__(self, path_model, params, verbose)
     52     self.model = llama_cpp.llama_load_model_from_file(
     53         self.path_model.encode("utf-8"), self.params
     54     )
     56 if self.model is None:
---> 57     raise ValueError(f"Failed to load model from file: {path_model}")

ValueError: Failed to load model from file: <placeholder>/LLM/models/phi-3-medium-4k-instruct.Q4_0.gguf

Hey @ibehnam -- do you mind pointing me to where you got the 4bit GGUF from? A helpful test would be to see if llama-cpp-python can load the file, with something like the following code:

from llama_cpp import Llama

llm = Llama(
      model_path="<placeholder>/LLM/models/phi-3-medium-4k-instruct.Q4_0.gguf",
      logits_all=True,
      n_gpu_layers=128,
      n_ctx=4096,
)

And perhaps a quick test of a generation:

output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

A look at your stack trace suggests that the issue may be coming from the upstream repo we depend on to interface with llama cpp (https://github.com/abetlen/llama-cpp-python), but I'm happy to try to debug on our side too.

@Harsha-Nori Thanks so much for your response. I did what you suggested and got the same error using llama-cpp-python. I'll dig more and try to find a workaround. I know llama.cpp can handle the new models (ollama runs phi-3-medium just fine), so it'll probably boil down to manually compiling llama.cpp for the llama-cpp-python package.