langchain-ai/langchain

llama_decode returns -1 when trying to add documents to a PGVector (store)

Opened this issue · 10 comments

Checked other resources

  • This is a bug, not a usage question.
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • This is not related to the langchain-community package.
  • I read what a minimal reproducible example is (https://stackoverflow.com/help/minimal-reproducible-example).
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Example Code

Example:

from langchain_community.embeddings import LlamaCppEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_postgres import PGVector
from langchain_text_splitters import RecursiveCharacterTextSplitter

import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader

llama = LlamaCppEmbeddings(model_path="/path/to/unsloth_gpt-oss-20b-GGUF_gpt-oss-20b-Q4_K_M.gguf", n_gpu_layers=99)

loader = WebBaseLoader(    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),   
bs_kwargs=dict(        parse_only=bs4.SoupStrainer(            class_=("post-content", "post-title", "post-header")        )    ),) 
docs = loader.load() # from the rag example [1]

vector_store = PGVector(embeddings=llama, collection_name="test", connection="postgresql+psycopg://rag:PW@localhost:5432", create_extension=False)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True)
all_splits = text_splitter.split_documents(docs)
ids = vector_store.add_documents(documents=all_splits)

[1]
https://python.langchain.com/docs/tutorials/rag/

Error Message and Stack Trace (if applicable)

init: invalid seq_id[204][0] = 1 >= 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[15], line 3
      1 text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True)
      2 all_splits = text_splitter.split_documents(docs)
----> 3 ids = vector_store.add_documents(documents=all_splits)

File [/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/langchain_core/vectorstores/base.py:279](http://localhost:8888/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/langchain_core/vectorstores/base.py#line=278), in VectorStore.add_documents(self, documents, **kwargs)
    277     texts = [doc.page_content for doc in documents]
    278     metadatas = [doc.metadata for doc in documents]
--> 279     return self.add_texts(texts, metadatas, **kwargs)
    280 msg = (
    281     f"`add_documents` and `add_texts` has not been implemented "
    282     f"for {self.__class__.__name__} "
    283 )
    284 raise NotImplementedError(msg)

File [/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/langchain_postgres/vectorstores.py:885](http://localhost:8888/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/langchain_postgres/vectorstores.py#line=884), in PGVector.add_texts(self, texts, metadatas, ids, **kwargs)
    883 assert not self._async_engine, "This method must be called without async_mode"
    884 texts_ = list(texts)
--> 885 embeddings = self.embedding_function.embed_documents(texts_)
    886 return self.add_embeddings(
    887     texts=texts_,
    888     embeddings=list(embeddings),
   (...)    891     **kwargs,
    892 )

File [/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/langchain_community/embeddings/llamacpp.py:119](http://localhost:8888/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/langchain_community/embeddings/llamacpp.py#line=118), in LlamaCppEmbeddings.embed_documents(self, texts)
    110 def embed_documents(self, texts: List[str]) -> List[List[float]]:
    111     """Embed a list of documents using the Llama model.
    112 
    113     Args:
   (...)    117         List of embeddings, one for each text.
    118     """
--> 119     embeddings = self.client.create_embedding(texts)
    120     final_embeddings = []
    121     for e in embeddings["data"]:

File [/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/llama_cpp/llama.py:980](http://localhost:8888/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/llama_cpp/llama.py#line=979), in Llama.create_embedding(self, input, model)
    978 embeds: Union[List[List[float]], List[List[List[float]]]]
    979 total_tokens: int
--> 980 embeds, total_tokens = self.embed(input, return_count=True)  # type: ignore
    982 # convert to CreateEmbeddingResponse
    983 data: List[Embedding] = [
    984     {
    985         "object": "embedding",
   (...)    989     for idx, emb in enumerate(embeds)
    990 ]

File [/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/llama_cpp/llama.py:1094](http://localhost:8888/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/llama_cpp/llama.py#line=1093), in Llama.embed(self, input, normalize, truncate, return_count)
   1092 # time to eval batch
   1093 if t_batch + n_tokens > n_batch:
-> 1094     decode_batch(s_batch)
   1095     s_batch = []
   1096     t_batch = 0

File [/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/llama_cpp/llama.py:1045](http://localhost:8888/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/llama_cpp/llama.py#line=1044), in Llama.embed.<locals>.decode_batch(seq_sizes)
   1043 def decode_batch(seq_sizes: List[int]):
   1044     llama_cpp.llama_kv_self_clear(self._ctx.ctx)
-> 1045     self._ctx.decode(self._batch)
   1046     self._batch.reset()
   1048     # store embeddings

File [/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/llama_cpp/_internals.py:327](http://localhost:8888/opt/anaconda3/envs/rag2/lib/python3.13/site-packages/llama_cpp/_internals.py#line=326), in LlamaContext.decode(self, batch)
    322 return_code = llama_cpp.llama_decode(
    323     self.ctx,
    324     batch.batch,
    325 )
    326 if return_code != 0:
--> 327     raise RuntimeError(f"llama_decode returned {return_code}")

RuntimeError: llama_decode returned -1

Description

I am trying to use pgvector as vector store and embed the example data from [1] into a local gpt-oss:20b model. However this fails with a runtime error. I expect the embedding to work without an error.

Calculating the embeddings seems to work.

vector_1 = embeddings.embed_query(all_splits[0].page_content)
llama_perf_context_print:        load time =   63623.89 ms
llama_perf_context_print: prompt eval time =   62139.37 ms [/](http://localhost:8888/)   208 tokens (  298.75 ms per token,     3.35 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms [/](http://localhost:8888/)     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   68931.87 ms [/](http://localhost:8888/)   209 tokens
llama_perf_context_print:    graphs reused =          0
len(vector_1)
2880

[1] https://python.langchain.com/docs/tutorials/rag/

System Info

from langchain_core import sys_info
sys_info.print_sys_info()
System Information
------------------
> OS:  Linux
> OS Version:  #1 SMP PREEMPT_DYNAMIC Mon, 22 Sep 2025 22:08:35 +0000
> Python Version:  3.13.7 | packaged by Anaconda, Inc. | (main, Sep  9 2025, 19:59:03) [GCC 11.2.0]

Package Information
-------------------
> langchain_core: 0.3.76
> langchain: 0.3.27
> langchain_community: 0.3.29
> langsmith: 0.4.30
> langchain_ollama: 0.3.8
> langchain_postgres: 0.0.15
> langchain_text_splitters: 0.3.11

Optional packages not installed
-------------------------------
> langserve

Other Dependencies
------------------
> aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
> async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
> asyncpg>=0.30.0: Installed. No version info available.
> dataclasses-json<0.7,>=0.6.7: Installed. No version info available.
> httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
> httpx<1,>=0.23.0: Installed. No version info available.
> jsonpatch<2.0,>=1.33: Installed. No version info available.
> langchain-anthropic;: Installed. No version info available.
> langchain-aws;: Installed. No version info available.
> langchain-azure-ai;: Installed. No version info available.
> langchain-cohere;: Installed. No version info available.
> langchain-community;: Installed. No version info available.
> langchain-core<0.4.0,>=0.2.13: Installed. No version info available.
> langchain-core<1.0.0,>=0.3.72: Installed. No version info available.
> langchain-core<1.0.0,>=0.3.76: Installed. No version info available.
> langchain-core<2.0.0,>=0.3.75: Installed. No version info available.
> langchain-deepseek;: Installed. No version info available.
> langchain-fireworks;: Installed. No version info available.
> langchain-google-genai;: Installed. No version info available.
> langchain-google-vertexai;: Installed. No version info available.
> langchain-groq;: Installed. No version info available.
> langchain-huggingface;: Installed. No version info available.
> langchain-mistralai;: Installed. No version info available.
> langchain-ollama;: Installed. No version info available.
> langchain-openai;: Installed. No version info available.
> langchain-perplexity;: Installed. No version info available.
> langchain-text-splitters<1.0.0,>=0.3.9: Installed. No version info available.
> langchain-together;: Installed. No version info available.
> langchain-xai;: Installed. No version info available.
> langchain<2.0.0,>=0.3.27: Installed. No version info available.
> langsmith-pyo3>=0.1.0rc2;: Installed. No version info available.
> langsmith>=0.1.125: Installed. No version info available.
> langsmith>=0.1.17: Installed. No version info available.
> langsmith>=0.3.45: Installed. No version info available.
> numpy<3,>=1.21: Installed. No version info available.
> numpy>=1.26.2;: Installed. No version info available.
> numpy>=2.1.0;: Installed. No version info available.
> ollama<1.0.0,>=0.5.3: Installed. No version info available.
> openai-agents>=0.0.3;: Installed. No version info available.
> opentelemetry-api>=1.30.0;: Installed. No version info available.
> opentelemetry-exporter-otlp-proto-http>=1.30.0;: Installed. No version info available.
> opentelemetry-sdk>=1.30.0;: Installed. No version info available.
> orjson>=3.9.14;: Installed. No version info available.
> packaging>=23.2: Installed. No version info available.
> pgvector<0.4,>=0.2.5: Installed. No version info available.
> psycopg-pool<4,>=3.2.1: Installed. No version info available.
> psycopg<4,>=3: Installed. No version info available.
> pydantic-settings<3.0.0,>=2.10.1: Installed. No version info available.
> pydantic<3,>=1: Installed. No version info available.
> pydantic<3.0.0,>=2.7.4: Installed. No version info available.
> pydantic>=2.7.4: Installed. No version info available.
> pytest>=7.0.0;: Installed. No version info available.
> PyYAML>=5.3: Installed. No version info available.
> requests-toolbelt>=1.0.0: Installed. No version info available.
> requests<3,>=2: Installed. No version info available.
> requests<3,>=2.32.5: Installed. No version info available.
> requests>=2.0.0: Installed. No version info available.
> rich>=13.9.4;: Installed. No version info available.
> SQLAlchemy<3,>=1.4: Installed. No version info available.
> sqlalchemy<3,>=2: Installed. No version info available.
> tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
> tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
> typing-extensions>=4.7: Installed. No version info available.
> vcrpy>=7.0.0;: Installed. No version info available.
> zstandard>=0.23.0: Installed. No version info available.

You need to tell the model to use a larger context window that can accommodate your text chunks. A context size of 2048 is a safe and common choice.

Update the line where you initialize LlamaCppEmbeddings like this:

llama = LlamaCppEmbeddings(
    model_path="/path/to/unsloth_gpt-oss-20b-GGUF_gpt-oss-20b-Q4_K_M.gguf", 
    n_gpu_layers=99,
    n_ctx=2048,  # <-- Add this: The context size
    n_batch=512  # <-- Add this: Set batch size for processing
)

By setting n_ctx=2048, you are ensuring the model's context window is large enough for your 1000-token chunks, which will resolve the error.

Fixed code :

from langchain_community.embeddings import LlamaCppEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_postgres import PGVector
from langchain_text_splitters import RecursiveCharacterTextSplitter

import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader

# ✅ Fix: Increase context size & batch size
llama = LlamaCppEmbeddings(
    model_path="/path/to/unsloth_gpt-oss-20b-GGUF_gpt-oss-20b-Q4_K_M.gguf",
    n_gpu_layers=99,
    n_ctx=2048,   # allow processing of 1000-token chunks
    n_batch=512   # batch size for efficiency
)

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
) 
docs = loader.load()  # from the RAG example

vector_store = PGVector(
    embeddings=llama,
    collection_name="test",
    connection="postgresql+psycopg://rag:PW@localhost:5432",
    create_extension=False
)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

ids = vector_store.add_documents(documents=all_splits)

While testing I played around with the context sizes default (512), 1024 and 4096 before submitting this bug report. Even with llama = LlamaCppEmbeddings( model_path="/path/to/unsloth_gpt-oss-20b-GGUF_gpt-oss-20b-Q4_K_M.gguf", n_ctx=2048, n_batch=512) the error persists on my machine, even when running the model on the CPU instead of the GPU. Do you have an idea how I can debug the llama_decode returned -1 further?

Error
The specific error you're encountering is llama_decode returned -1.

This is a generic, low-level error message originating from the underlying llama.cpp library that LangChain's LlamaCppEmbeddings uses. It indicates that a critical failure occurred during the model's attempt to process the input text (a process called decoding or inference).

The error is not in the Python syntax itself but points to a deeper problem. The most common causes for this are:

Corrupted Model File: The .gguf model file you downloaded might be incomplete or corrupted. This is the most frequent cause of this specific error.

Memory Issues: Even on the CPU, processing large batches of text can exhaust your system's RAM, leading to a crash that the library reports as a decode error. Your batch size of n_batch=512 combined with a chunk_size of 1000 is quite memory-intensive.

Library Version Mismatch: The version of the llama-cpp-python library installed in your environment might not be fully compatible with the specific quantization method used for your GGUF model.


Step 1: Create a Minimal Test Script

First, let's confirm the model can load and run correctly in isolation, outside of your complex data pipeline. Create a new file named test_model.py:

# test_model.py
from langchain_community.embeddings import LlamaCppEmbeddings

try:
    print("Attempting to initialize the model...")
    # Add verbose=True for more detailed logs from llama.cpp
    # Reduce n_batch to a very small number to rule out memory issues
    llama = LlamaCppEmbeddings(
        model_path="/path/to/unsloth_gpt-oss-20b-GGUF_gpt-oss-20b-Q4_K_M.gguf",
        n_ctx=2048,
        n_batch=32, # Start with a much smaller batch size
        n_gpu_layers=0, # Force CPU to ensure it's not a GPU issue
        verbose=True  # Enable detailed logging
    )
    print("Model initialized successfully.")

    print("Attempting to embed a simple text...")
    test_text = "This is a simple sentence for testing the embedding model."
    vector = llama.embed_query(test_text)
    print("Embedding successful!")
    print(f"Vector dimension: {len(vector)}")
    # print(f"First 5 elements: {vector[:5]}") # Uncomment to see the vector

except Exception as e:
    print(f"An error occurred: {e}")

Step 2: Update Your Libraries

Ensure you have the latest compatible versions of the libraries.

pip install --upgrade langchain langchain-community llama-cpp-python

Step 3: Correct the Original Code

Once the minimal test works, you can apply the learnings to your original script. I also noticed a guaranteed future error in your database connection string, which I've corrected below.

from langchain_community.embeddings import LlamaCppEmbeddings
from langchain_community.document_loaders import WebBaseLoader
from langchain_postgres import PGVector
from langchain_text_splitters import RecursiveCharacterTextSplitter
import bs4

# Use parameters proven to work from the minimal test
# A smaller batch size is safer for processing many documents
llama = LlamaCppEmbeddings(
    model_path="/path/to/unsloth_gpt-oss-20b-GGUF_gpt-oss-20b-Q4_K_M.gguf",
    n_ctx=2048,
    n_batch=64, # Use a smaller, safer batch size
    n_gpu_layers=99, # You can restore GPU layers if CPU test passes
    verbose=True
)

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

# FIX: You were missing the database name at the end of the connection string.
# Format: postgresql+psycopg://user:password@host:port/database_name
connection_string = "postgresql+psycopg://rag:PW@localhost:5432/your_db_name"

vector_store = PGVector(
    embeddings=llama,
    collection_name="test",
    connection=connection_string,
    create_extension=False
)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True)
all_splits = text_splitter.split_documents(docs)

print(f"Adding {len(all_splits)} document splits to the vector store...")
ids = vector_store.add_documents(documents=all_splits)
print("Successfully added documents.")

Thanks for the test_model.py. It runs without an error (see full output at the end of this comment). llama.embed_query runs as expected as before. Upgrading the packages did not help.

(rag2) [simon@chimchar rag]$ pip freeze | grep llama_cpp_python
llama_cpp_python==0.3.16
(rag2) [simon@chimchar rag]$ pip freeze | grep langchain
langchain==0.3.27
langchain-community==0.3.30
langchain-core==0.3.76
langchain-ollama==0.3.8
langchain-postgres==0.0.15
langchain-text-splitters==0.3.11

I've added the database name to the connection string as well, but that (obviously) did not help, either. I've re-downloaded the model file, too. That changed nothing. My next step is to convert the original gpt-oss safetensors to gguf and testing the resulting gguf file.

llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Graphics) - 20822 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 459 tensors from [/home/simon/.cache/llama.cpp/unsloth_gpt-oss-20b-GGUF_gpt-oss-20b-Q4_K_M.gguf](http://localhost:8888/home/simon/.cache/llama.cpp/unsloth_gpt-oss-20b-GGUF_gpt-oss-20b-Q4_K_M.gguf) (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys[/values.](http://localhost:8888/values.) Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gpt-Oss-20B
llama_model_loader: - kv   3:                           general.basename str              = Gpt-Oss-20B
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 20B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   8:                               general.tags arr[str,2]       = ["vllm", "text-generation"]
llama_model_loader: - kv   9:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv  10:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv  11:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv  12:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  13:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  14:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  16:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  18:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  19:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  20:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  21:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  22:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  23:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  24:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  25: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 200002
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 200017
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {# Chat template fixes by Unsloth #}\n...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - kv  36:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q5_0:   61 tensors
llama_model_loader: - type q8_0:   13 tensors
llama_model_loader: - type q4_K:   24 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 10.81 GiB (4.44 BPW) 

Attempting to initialize the model...

init_tokenizer: initializing tokenizer for type 2
load: control token: 200017 '<|reserved_200017|>' is not marked as EOG
load: control token: 200014 '<|reserved_200014|>' is not marked as EOG
load: control token: 200011 '<|reserved_200011|>' is not marked as EOG
load: control token: 200009 '<|reserved_200009|>' is not marked as EOG
load: control token: 200008 '<|message|>' is not marked as EOG
load: control token: 200006 '<|start|>' is not marked as EOG
load: control token: 200004 '<|reserved_200004|>' is not marked as EOG
load: control token: 200003 '<|constrain|>' is not marked as EOG
load: control token: 200000 '<|reserved_200000|>' is not marked as EOG
load: control token: 200005 '<|channel|>' is not marked as EOG
load: control token: 200010 '<|reserved_200010|>' is not marked as EOG
load: control token: 200016 '<|reserved_200016|>' is not marked as EOG
load: control token: 200013 '<|reserved_200013|>' is not marked as EOG
load: control token: 199998 '<|startoftext|>' is not marked as EOG
load: control token: 200018 '<|endofprompt|>' is not marked as EOG
load: control token: 200001 '<|reserved_200001|>' is not marked as EOG
load: control token: 200015 '<|reserved_200015|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = ?B
print_info: model params     = 20.91 B
print_info: general.name     = Gpt-Oss-20B
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 200002 '<|return|>'
print_info: EOT token        = 199999 '<|endoftext|>'
print_info: PAD token        = 200017 '<|reserved_200017|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 1
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 1
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 1
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 1
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 1
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 1
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 1
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 1
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 1
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 1
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 1
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 1
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q5_0) (and 458 others) cannot be used with preferred buffer type Vulkan_Host, using CPU instead
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0[/25](http://localhost:8888/25) layers to GPU
load_tensors:   CPU_Mapped model buffer size = 11073.83 MiB
...............................................................................
llama_context: constructing llama_context
llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 64
llama_context: n_ubatch      = 32
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.78 MiB
create_memory: n_ctx = 2048 (padded)
llama_kv_cache_unified_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 2048 cells
llama_kv_cache_unified: layer   0: skipped
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: skipped
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: skipped
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: skipped
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: skipped
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: skipped
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: skipped
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: skipped
llama_kv_cache_unified: layer  15: dev = CPU
llama_kv_cache_unified: layer  16: skipped
llama_kv_cache_unified: layer  17: dev = CPU
llama_kv_cache_unified: layer  18: skipped
llama_kv_cache_unified: layer  19: dev = CPU
llama_kv_cache_unified: layer  20: skipped
llama_kv_cache_unified: layer  21: dev = CPU
llama_kv_cache_unified: layer  22: skipped
llama_kv_cache_unified: layer  23: dev = CPU
llama_kv_cache_unified:        CPU KV buffer size =    48.00 MiB
llama_kv_cache_unified: size =   48.00 MiB (  2048 cells,  12 layers,  1[/1](http://localhost:8888/1) seqs), K (f16):   24.00 MiB, V (f16):   24.00 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 2048 cells
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: skipped
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: skipped
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: skipped
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: skipped
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: skipped
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: skipped
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: skipped
llama_kv_cache_unified: layer  14: dev = CPU
llama_kv_cache_unified: layer  15: skipped
llama_kv_cache_unified: layer  16: dev = CPU
llama_kv_cache_unified: layer  17: skipped
llama_kv_cache_unified: layer  18: dev = CPU
llama_kv_cache_unified: layer  19: skipped
llama_kv_cache_unified: layer  20: dev = CPU
llama_kv_cache_unified: layer  21: skipped
llama_kv_cache_unified: layer  22: dev = CPU
llama_kv_cache_unified: layer  23: skipped
llama_kv_cache_unified:        CPU KV buffer size =    48.00 MiB
llama_kv_cache_unified: size =   48.00 MiB (  2048 cells,  12 layers,  1[/1](http://localhost:8888/1) seqs), K (f16):   24.00 MiB, V (f16):   24.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 3672
llama_context: worst-case: n_tokens = 32, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens =   32, n_seqs =  1, n_outputs =   32
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =   32, n_seqs =  1, n_outputs =   32
llama_context:    Vulkan0 compute buffer size =   611.72 MiB
llama_context: Vulkan_Host compute buffer size =     4.06 MiB
llama_context: graph nodes  = 1446
llama_context: graph splits = 556 (with bs=32), 1 (with bs=1)
CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
Model metadata: {'general.file_type': '15', 'general.quantization_version': '2', 'tokenizer.chat_template': '{# Chat template fixes by Unsloth #}\n{#-\n  In addition to the normal inputs of `messages` and `tools`, this template also accepts the\n  following kwargs:\n  - "builtin_tools": A list, can contain "browser" and[/or](http://localhost:8888/or) "python".\n  - "model_identity": A string that optionally describes the model identity.\n  - "reasoning_effort": A string that describes the reasoning effort, defaults to "medium".\n #}\n\n{#- Tool Definition Rendering ============================================== #}\n{%- macro render_typescript_type(param_spec, required_params, is_nullable=false) -%}\n    {%- if param_spec.type == "array" -%}\n        {%- if param_spec[\'items\'] -%}\n            {%- if param_spec[\'items\'][\'type\'] == "string" -%}\n                {{- "string[]" }}\n            {%- elif param_spec[\'items\'][\'type\'] == "number" -%}\n                {{- "number[]" }}\n            {%- elif param_spec[\'items\'][\'type\'] == "integer" -%}\n                {{- "number[]" }}\n            {%- elif param_spec[\'items\'][\'type\'] == "boolean" -%}\n                {{- "boolean[]" }}\n            {%- else -%}\n                {%- set inner_type = render_typescript_type(param_spec[\'items\'], required_params) -%}\n                {%- if inner_type == "object | object" or inner_type|length > 50 -%}\n                    {{- "any[]" }}\n                {%- else -%}\n                    {{- inner_type + "[]" }}\n                {%- endif -%}\n            {%- endif -%}\n            {%- if param_spec.nullable -%}\n                {{- " | null" }}\n            {%- endif -%}\n        {%- else -%}\n            {{- "any[]" }}\n            {%- if param_spec.nullable -%}\n                {{- " | null" }}\n            {%- endif -%}\n        {%- endif -%}\n    {%- elif param_spec.type is defined and param_spec.type is iterable and param_spec.type is not string and param_spec.type is not mapping and param_spec.type[0] is defined -%}\n        {#- Handle array of types like ["object", "object"] from Union[dict, list] #}\n        {%- if param_spec.type | length > 1 -%}\n            {{- param_spec.type | join(" | ") }}\n        {%- else -%}\n            {{- param_spec.type[0] }}\n        {%- endif -%}\n    {%- elif param_spec.oneOf -%}\n        {#- Handle oneOf schemas - check for complex unions and fallback to any #}\n        {%- set has_object_variants = false -%}\n        {%- for variant in param_spec.oneOf -%}\n            {%- if variant.type == "object" -%}\n                {%- set has_object_variants = true -%}\n            {%- endif -%}\n        {%- endfor -%}\n        {%- if has_object_variants and param_spec.oneOf|length > 1 -%}\n            {{- "any" }}\n        {%- else -%}\n            {%- for variant in param_spec.oneOf -%}\n                {{- render_typescript_type(variant, required_params) -}}\n                {%- if variant.description %}\n                    {{- "// " + variant.description }}\n                {%- endif -%}\n                {%- if variant.default is defined %}\n                    {{ "// default: " + variant.default|tojson }}\n                {%- endif -%}\n                {%- if not loop.last %}\n                    {{- " | " }}\n                {% endif -%}\n            {%- endfor -%}\n        {%- endif -%}\n    {%- elif param_spec.type == "string" -%}\n        {%- if param_spec.enum -%}\n            {{- \'"\' + param_spec.enum|join(\'" | "\') + \'"\' -}}\n        {%- else -%}\n            {{- "string" }}\n            {%- if param_spec.nullable %}\n                {{- " | null" }}\n            {%- endif -%}\n        {%- endif -%}\n    {%- elif param_spec.type == "number" -%}\n        {{- "number" }}\n    {%- elif param_spec.type == "integer" -%}\n        {{- "number" }}\n    {%- elif param_spec.type == "boolean" -%}\n        {{- "boolean" }}\n\n    {%- elif param_spec.type == "object" -%}\n        {%- if param_spec.properties -%}\n            {{- "{\\n" }}\n            {%- for prop_name, prop_spec in param_spec.properties.items() -%}\n                {{- prop_name -}}\n                {%- if prop_name not in (param_spec.required or []) -%}\n                    {{- "?" }}\n                {%- endif -%}\n                {{- ": " }}\n                {{ render_typescript_type(prop_spec, param_spec.required or []) }}\n                {%- if not loop.last -%}\n                    {{-", " }}\n                {%- endif -%}\n            {%- endfor -%}\n            {{- "}" }}\n        {%- else -%}\n            {{- "object" }}\n        {%- endif -%}\n    {%- else -%}\n        {{- "any" }}\n    {%- endif -%}\n{%- endmacro -%}\n\n{%- macro render_tool_namespace(namespace_name, tools) -%}\n    {{- "## " + namespace_name + "\\n\\n" }}\n    {{- "namespace " + namespace_name + " {\\n\\n" }}\n    {%- for tool in tools %}\n        {%- set tool = tool.function %}\n        {{- "// " + tool.description + "\\n" }}\n        {{- "type "+ tool.name + " = " }}\n        {%- if tool.parameters and tool.parameters.properties %}\n            {{- "(_: {\\n" }}\n            {%- for param_name, param_spec in tool.parameters.properties.items() %}\n                {%- if param_spec.description %}\n                    {{- "// " + param_spec.description + "\\n" }}\n                {%- endif %}\n                {{- param_name }}\n                {%- if param_name not in (tool.parameters.required or []) -%}\n                    {{- "?" }}\n                {%- endif -%}\n                {{- ": " }}\n                {{- render_typescript_type(param_spec, tool.parameters.required or []) }}\n                {%- if param_spec.default is defined -%}\n                    {%- if param_spec.enum %}\n                        {{- ", // default: " + param_spec.default }}\n                    {%- elif param_spec.oneOf %}\n                        {{- "// default: " + param_spec.default }}\n                    {%- else %}\n                        {{- ", // default: " + param_spec.default|tojson }}\n                    {%- endif -%}\n                {%- endif -%}\n                {%- if not loop.last %}\n                    {{- ",\\n" }}\n                {%- else %}\n                    {{- ",\\n" }}\n                {%- endif -%}\n            {%- endfor %}\n            {{- "}) => any;\\n\\n" }}\n        {%- else -%}\n            {{- "() => any;\\n\\n" }}\n        {%- endif -%}\n    {%- endfor %}\n    {{- "} // namespace " + namespace_name }}\n{%- endmacro -%}\n\n{%- macro render_builtin_tools(browser_tool, python_tool) -%}\n    {%- if browser_tool %}\n        {{- "## browser\\n\\n" }}\n        {{- "// Tool for browsing.\\n" }}\n        {{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.\\n" }}\n        {{- "// Cite information from the tool using the following format:\\n" }}\n        {{- "// `【{cursor}†L{line_start}(-L{line_end})?】`, for example: `【6†L9-L11】` or `【8†L3】`.\\n" }}\n        {{- "// Do not quote more than 10 words directly from the tool output.\\n" }}\n        {{- "// sources=web (default: web)\\n" }}\n        {{- "namespace browser {\\n\\n" }}\n        {{- "// Searches for information related to `query` and displays `topn` results.\\n" }}\n        {{- "type search = (_: {\\n" }}\n        {{- "query: string,\\n" }}\n        {{- "topn?: number, // default: 10\\n" }}\n        {{- "source?: string,\\n" }}\n        {{- "}) => any;\\n\\n" }}\n        {{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.\\n" }}\n        {{- "// Valid link ids are displayed with the formatting: `【{id}†.*】`.\\n" }}\n        {{- "// If `cursor` is not provided, the most recent page is implied.\\n" }}\n        {{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.\\n" }}\n        {{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.\\n" }}\n        {{- "// Use this function without `id` to scroll to a new location of an opened page.\\n" }}\n        {{- "type open = (_: {\\n" }}\n        {{- "id?: number | string, // default: -1\\n" }}\n        {{- "cursor?: number, // default: -1\\n" }}\n        {{- "loc?: number, // default: -1\\n" }}\n        {{- "num_lines?: number, // default: -1\\n" }}\n        {{- "view_source?: boolean, // default: false\\n" }}\n        {{- "source?: string,\\n" }}\n        {{- "}) => any;\\n\\n" }}\n        {{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.\\n" }}\n        {{- "type find = (_: {\\n" }}\n        {{- "pattern: string,\\n" }}\n        {{- "cursor?: number, // default: -1\\n" }}\n        {{- "}) => any;\\n\\n" }}\n        {{- "} // namespace browser\\n\\n" }}\n    {%- endif -%}\n\n    {%- if python_tool %}\n        {{- "## python\\n\\n" }}\n        {{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).\\n\\n" }}\n        {{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at \'[/mnt/data](http://localhost:8888/mnt/data)\' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.\\n\\n" }}\n    {%- endif -%}\n{%- endmacro -%}\n\n{#- System Message Construction ============================================ #}\n{%- macro build_system_message() -%}\n    {%- if model_identity is not defined %}\n        {%- set model_identity = "You are ChatGPT, a large language model trained by OpenAI." %}\n    {%- endif %}\n    {{- model_identity + "\\n" }}\n    {{- "Knowledge cutoff: 2024-06\\n" }}\n    {{- "Current date: " + strftime_now("%Y-%m-%d") + "\\n\\n" }}\n    {%- if reasoning_effort is not defined %}\n        {%- set reasoning_effort = "medium" %}\n    {%- endif %}\n    {{- "Reasoning: " + reasoning_effort + "\\n\\n" }}\n    {%- if builtin_tools is defined and builtin_tools is not none %}\n        {{- "# Tools\\n\\n" }}\n        {%- set available_builtin_tools = namespace(browser=false, python=false) %}\n        {%- for tool in builtin_tools %}\n            {%- if tool == "browser" %}\n                {%- set available_builtin_tools.browser = true %}\n            {%- elif tool == "python" %}\n                {%- set available_builtin_tools.python = true %}\n            {%- endif %}\n        {%- endfor %}\n        {{- render_builtin_tools(available_builtin_tools.browser, available_builtin_tools.python) }}\n    {%- endif -%}\n    {{- "# Valid channels: analysis, commentary, final. Channel must be included for every message." }}\n    {%- if tools -%}\n        {{- "\\nCalls to these tools must go to the commentary channel: \'functions\'." }}\n    {%- endif -%}\n{%- endmacro -%}\n\n{#- Main Template Logic ================================================= #}\n{#- Set defaults #}\n\n{#- Render system message #}\n{{- "<|start|>system<|message|>" }}\n{{- build_system_message() }}\n{{- "<|end|>" }}\n\n{#- Extract developer message #}\n{%- if developer_instructions is defined and developer_instructions is not none %}\n    {%- set developer_message = developer_instructions %}\n    {%- set loop_messages = messages %}\n{%- elif messages[0].role == "developer" or messages[0].role == "system" %}\n    {%- set developer_message = messages[0].content %}\n    {%- set loop_messages = messages[1:] %}\n{%- else %}\n    {%- set developer_message = "" %}\n    {%- set loop_messages = messages %}\n{%- endif %}\n\n{#- Render developer message #}\n{%- if developer_message or tools %}\n    {{- "<|start|>developer<|message|>" }}\n    {%- if developer_message %}\n        {{- "# Instructions\\n\\n" }}\n        {{- developer_message }}\n    {%- endif %}\n    {%- if tools -%}\n        {%- if developer_message %}\n            {{- "\\n\\n" }}\n        {%- endif %}\n        {{- "# Tools\\n\\n" }}\n        {{- render_tool_namespace("functions", tools) }}\n    {%- endif -%}\n    {{- "<|end|>" }}\n{%- endif %}\n\n{#- Render messages #}\n{%- set last_tool_call = namespace(name=none) %}\n{%- for message in loop_messages -%}\n    {#- At this point only assistant[/user/tool](http://localhost:8888/user/tool) messages should remain #}\n    {%- if message.role == \'assistant\' -%}\n        {#- Checks to ensure the messages are being passed in the format we expect #}\n        {%- if "thinking" in message %}\n            {%- if "<|channel|>analysis<|message|>" in message.thinking or "<|channel|>final<|message|>" in message.thinking %}\n                {{- raise_exception("You have passed a message containing <|channel|> tags in the thinking field. Instead of doing this, you should pass analysis messages (the string between \'<|message|>\' and \'<|end|>\') in the \'thinking\' field, and final messages (the string between \'<|message|>\' and \'<|end|>\') in the \'content\' field.") }}\n            {%- endif %}\n        {%- endif %}\n        {%- if "tool_calls" in message %}\n            {#- We need very careful handling here - we want to drop the tool call analysis message if the model #}\n            {#- has output a later <|final|> message, but otherwise we want to retain it. This is the only case #}\n            {#- when we render CoT[/analysis](http://localhost:8888/analysis) messages in inference. #}\n            {%- set future_final_message = namespace(found=false) %}\n            {%- for future_message in loop_messages[loop.index:] %}\n                {%- if future_message.role == \'assistant\' and "tool_calls" not in future_message %}\n                    {%- set future_final_message.found = true %}\n                {%- endif %}\n            {%- endfor %}\n            {#- We assume max 1 tool call per message, and so we infer the tool call name #}\n            {#- in "tool" messages from the most recent assistant tool call name #}\n            {%- set tool_call = message.tool_calls[0] %}\n            {%- if tool_call.function %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {%- if message.content and message.thinking %}\n                {{- raise_exception("Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.") }}\n            {%- elif message.content and not future_final_message.found %}\n                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}\n            {%- elif message.thinking and not future_final_message.found %}\n                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}\n            {%- endif %}\n            {{- "<|start|>assistant to=" }}\n            {{- "functions." + tool_call.name + "<|channel|>commentary " }}\n            {{- (tool_call.content_type if tool_call.content_type is defined else "json") + "<|message|>" }}\n            {%- if tool_call.arguments is string %}\n                {{- tool_call.arguments }}\n            {%- else %}\n                {{- tool_call.arguments|tojson }}\n            {%- endif %}\n            {{- "<|call|>" }}\n            {%- set last_tool_call.name = tool_call.name %}\n        {%- elif loop.last and not add_generation_prompt %}\n            {#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}\n            {#- This is a situation that should only occur in training, never in inference. #}\n            {%- if "thinking" in message %}\n                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}\n            {%- endif %}\n            {#- <|return|> indicates the end of generation, but <|end|> does not #}\n            {#- <|return|> should never be an input to the model, but we include it as the final token #}\n            {#- when training, so the model learns to emit it. #}\n            {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}\n        {%- elif "thinking" in message %}\n            {#- CoT is dropped during all previous turns, so we never render it for inference #}\n            {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}\n            {%- set last_tool_call.name = none %}\n        {%- else %}\n            {#- CoT is dropped during all previous turns, so we never render it for inference #}\n            {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}\n            {%- set last_tool_call.name = none %}\n        {%- endif %}\n    {%- elif message.role == \'tool\' -%}\n        {%- if last_tool_call.name is none %}\n            {{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}\n        {%- endif %}\n        {{- "<|start|>functions." + last_tool_call.name }}\n        {%- if message.content is string %}\n            {{- " to=assistant<|channel|>commentary<|message|>" + message.content + "<|end|>" }}\n        {%- else %}\n            {{- " to=assistant<|channel|>commentary<|message|>" + message.content|tojson + "<|end|>" }}\n        {%- endif %}\n    {%- elif message.role == \'user\' -%}\n        {{- "<|start|>user<|message|>" + message.content + "<|end|>" }}\n    {%- endif -%}\n{%- endfor -%}\n\n{#- Generation prompt #}\n{%- if add_generation_prompt -%}\n<|start|>assistant\n{%- endif -%}\n{# Copyright 2025-present Unsloth. Apache 2.0 License. Unsloth chat template fixes. Edited from ggml-org & OpenAI #}', 'gpt-oss.attention.head_count': '64', 'gpt-oss.rope.scaling.original_context_length': '4096', 'gpt-oss.feed_forward_length': '2880', 'general.repo_url': 'https://huggingface.co/unsloth', 'general.license': 'apache-2.0', 'general.size_label': '20B', 'general.type': 'model', 'tokenizer.ggml.padding_token_id': '200017', 'gpt-oss.context_length': '131072', 'general.quantized_by': 'Unsloth', 'gpt-oss.embedding_length': '2880', 'gpt-oss.block_count': '24', 'gpt-oss.attention.sliding_window': '128', 'tokenizer.ggml.pre': 'gpt-4o', 'general.architecture': 'gpt-oss', 'gpt-oss.rope.freq_base': '150000.000000', 'gpt-oss.attention.head_count_kv': '8', 'gpt-oss.attention.layer_norm_rms_epsilon': '0.000010', 'gpt-oss.expert_count': '32', 'general.basename': 'Gpt-Oss-20B', 'gpt-oss.attention.key_length': '64', 'gpt-oss.expert_used_count': '4', 'gpt-oss.expert_feed_forward_length': '2880', 'gpt-oss.rope.scaling.type': 'yarn', 'tokenizer.ggml.eos_token_id': '200002', 'gpt-oss.rope.scaling.factor': '32.000000', 'tokenizer.ggml.model': 'gpt2', 'general.name': 'Gpt-Oss-20B', 'gpt-oss.attention.value_length': '64', 'tokenizer.ggml.bos_token_id': '199998'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {# Chat template fixes by Unsloth #}
{#-
  In addition to the normal inputs of `messages` and `tools`, this template also accepts the
  following kwargs:
  - "builtin_tools": A list, can contain "browser" and[/or](http://localhost:8888/or) "python".
  - "model_identity": A string that optionally describes the model identity.
  - "reasoning_effort": A string that describes the reasoning effort, defaults to "medium".
 #}

{#- Tool Definition Rendering ============================================== #}
{%- macro render_typescript_type(param_spec, required_params, is_nullable=false) -%}
    {%- if param_spec.type == "array" -%}
        {%- if param_spec['items'] -%}
            {%- if param_spec['items']['type'] == "string" -%}
                {{- "string[]" }}
            {%- elif param_spec['items']['type'] == "number" -%}
                {{- "number[]" }}
            {%- elif param_spec['items']['type'] == "integer" -%}
                {{- "number[]" }}
            {%- elif param_spec['items']['type'] == "boolean" -%}
                {{- "boolean[]" }}
            {%- else -%}
                {%- set inner_type = render_typescript_type(param_spec['items'], required_params) -%}
                {%- if inner_type == "object | object" or inner_type|length > 50 -%}
                    {{- "any[]" }}
                {%- else -%}
                    {{- inner_type + "[]" }}
                {%- endif -%}
            {%- endif -%}
            {%- if param_spec.nullable -%}
                {{- " | null" }}
            {%- endif -%}
        {%- else -%}
            {{- "any[]" }}
            {%- if param_spec.nullable -%}
                {{- " | null" }}
            {%- endif -%}
        {%- endif -%}
    {%- elif param_spec.type is defined and param_spec.type is iterable and param_spec.type is not string and param_spec.type is not mapping and param_spec.type[0] is defined -%}
        {#- Handle array of types like ["object", "object"] from Union[dict, list] #}
        {%- if param_spec.type | length > 1 -%}
            {{- param_spec.type | join(" | ") }}
        {%- else -%}
            {{- param_spec.type[0] }}
        {%- endif -%}
    {%- elif param_spec.oneOf -%}
        {#- Handle oneOf schemas - check for complex unions and fallback to any #}
        {%- set has_object_variants = false -%}
        {%- for variant in param_spec.oneOf -%}
            {%- if variant.type == "object" -%}
                {%- set has_object_variants = true -%}
            {%- endif -%}
        {%- endfor -%}
        {%- if has_object_variants and param_spec.oneOf|length > 1 -%}
            {{- "any" }}
        {%- else -%}
            {%- for variant in param_spec.oneOf -%}
                {{- render_typescript_type(variant, required_params) -}}
                {%- if variant.description %}
                    {{- "// " + variant.description }}
                {%- endif -%}
                {%- if variant.default is defined %}
                    {{ "// default: " + variant.default|tojson }}
                {%- endif -%}
                {%- if not loop.last %}
                    {{- " | " }}
                {% endif -%}
            {%- endfor -%}
        {%- endif -%}
    {%- elif param_spec.type == "string" -%}
        {%- if param_spec.enum -%}
            {{- '"' + param_spec.enum|join('" | "') + '"' -}}
        {%- else -%}
            {{- "string" }}
            {%- if param_spec.nullable %}
                {{- " | null" }}
            {%- endif -%}
        {%- endif -%}
    {%- elif param_spec.type == "number" -%}
        {{- "number" }}
    {%- elif param_spec.type == "integer" -%}
        {{- "number" }}
    {%- elif param_spec.type == "boolean" -%}
        {{- "boolean" }}

    {%- elif param_spec.type == "object" -%}
        {%- if param_spec.properties -%}
            {{- "{\n" }}
            {%- for prop_name, prop_spec in param_spec.properties.items() -%}
                {{- prop_name -}}
                {%- if prop_name not in (param_spec.required or []) -%}
                    {{- "?" }}
                {%- endif -%}
                {{- ": " }}
                {{ render_typescript_type(prop_spec, param_spec.required or []) }}
                {%- if not loop.last -%}
                    {{-", " }}
                {%- endif -%}
            {%- endfor -%}
            {{- "}" }}
        {%- else -%}
            {{- "object" }}
        {%- endif -%}
    {%- else -%}
        {{- "any" }}
    {%- endif -%}
{%- endmacro -%}

{%- macro render_tool_namespace(namespace_name, tools) -%}
    {{- "## " + namespace_name + "\n\n" }}
    {{- "namespace " + namespace_name + " {\n\n" }}
    {%- for tool in tools %}
        {%- set tool = tool.function %}
        {{- "// " + tool.description + "\n" }}
        {{- "type "+ tool.name + " = " }}
        {%- if tool.parameters and tool.parameters.properties %}
            {{- "(_: {\n" }}
            {%- for param_name, param_spec in tool.parameters.properties.items() %}
                {%- if param_spec.description %}
                    {{- "// " + param_spec.description + "\n" }}
                {%- endif %}
                {{- param_name }}
                {%- if param_name not in (tool.parameters.required or []) -%}
                    {{- "?" }}
                {%- endif -%}
                {{- ": " }}
                {{- render_typescript_type(param_spec, tool.parameters.required or []) }}
                {%- if param_spec.default is defined -%}
                    {%- if param_spec.enum %}
                        {{- ", // default: " + param_spec.default }}
                    {%- elif param_spec.oneOf %}
                        {{- "// default: " + param_spec.default }}
                    {%- else %}
                        {{- ", // default: " + param_spec.default|tojson }}
                    {%- endif -%}
                {%- endif -%}
                {%- if not loop.last %}
                    {{- ",\n" }}
                {%- else %}
                    {{- ",\n" }}
                {%- endif -%}
            {%- endfor %}
            {{- "}) => any;\n\n" }}
        {%- else -%}
            {{- "() => any;\n\n" }}
        {%- endif -%}
    {%- endfor %}
    {{- "} // namespace " + namespace_name }}
{%- endmacro -%}

{%- macro render_builtin_tools(browser_tool, python_tool) -%}
    {%- if browser_tool %}
        {{- "## browser\n\n" }}
        {{- "// Tool for browsing.\n" }}
        {{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.\n" }}
        {{- "// Cite information from the tool using the following format:\n" }}
        {{- "// `【{cursor}†L{line_start}(-L{line_end})?】`, for example: `【6†L9-L11】` or `【8†L3】`.\n" }}
        {{- "// Do not quote more than 10 words directly from the tool output.\n" }}
        {{- "// sources=web (default: web)\n" }}
        {{- "namespace browser {\n\n" }}
        {{- "// Searches for information related to `query` and displays `topn` results.\n" }}
        {{- "type search = (_: {\n" }}
        {{- "query: string,\n" }}
        {{- "topn?: number, // default: 10\n" }}
        {{- "source?: string,\n" }}
        {{- "}) => any;\n\n" }}
        {{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.\n" }}
        {{- "// Valid link ids are displayed with the formatting: `【{id}†.*】`.\n" }}
        {{- "// If `cursor` is not provided, the most recent page is implied.\n" }}
        {{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.\n" }}
        {{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.\n" }}
        {{- "// Use this function without `id` to scroll to a new location of an opened page.\n" }}
        {{- "type open = (_: {\n" }}
        {{- "id?: number | string, // default: -1\n" }}
        {{- "cursor?: number, // default: -1\n" }}
        {{- "loc?: number, // default: -1\n" }}
        {{- "num_lines?: number, // default: -1\n" }}
        {{- "view_source?: boolean, // default: false\n" }}
        {{- "source?: string,\n" }}
        {{- "}) => any;\n\n" }}
        {{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.\n" }}
        {{- "type find = (_: {\n" }}
        {{- "pattern: string,\n" }}
        {{- "cursor?: number, // default: -1\n" }}
        {{- "}) => any;\n\n" }}
        {{- "} // namespace browser\n\n" }}
    {%- endif -%}

    {%- if python_tool %}
        {{- "## python\n\n" }}
        {{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).\n\n" }}
        {{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '[/mnt/data](http://localhost:8888/mnt/data)' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.\n\n" }}
    {%- endif -%}
{%- endmacro -%}

{#- System Message Construction ============================================ #}
{%- macro build_system_message() -%}
    {%- if model_identity is not defined %}
        {%- set model_identity = "You are ChatGPT, a large language model trained by OpenAI." %}
    {%- endif %}
    {{- model_identity + "\n" }}
    {{- "Knowledge cutoff: 2024-06\n" }}
    {{- "Current date: " + strftime_now("%Y-%m-%d") + "\n\n" }}
    {%- if reasoning_effort is not defined %}
        {%- set reasoning_effort = "medium" %}
    {%- endif %}
    {{- "Reasoning: " + reasoning_effort + "\n\n" }}
    {%- if builtin_tools is defined and builtin_tools is not none %}
        {{- "# Tools\n\n" }}
        {%- set available_builtin_tools = namespace(browser=false, python=false) %}
        {%- for tool in builtin_tools %}
            {%- if tool == "browser" %}
                {%- set available_builtin_tools.browser = true %}
            {%- elif tool == "python" %}
                {%- set available_builtin_tools.python = true %}
            {%- endif %}
        {%- endfor %}
        {{- render_builtin_tools(available_builtin_tools.browser, available_builtin_tools.python) }}
    {%- endif -%}
    {{- "# Valid channels: analysis, commentary, final. Channel must be included for every message." }}
    {%- if tools -%}
        {{- "\nCalls to these tools must go to the commentary channel: 'functions'." }}
    {%- endif -%}
{%- endmacro -%}

{#- Main Template Logic ================================================= #}
{#- Set defaults #}

{#- Render system message #}
{{- "<|start|>system<|message|>" }}
{{- build_system_message() }}
{{- "<|end|>" }}

{#- Extract developer message #}
{%- if developer_instructions is defined and developer_instructions is not none %}
    {%- set developer_message = developer_instructions %}
    {%- set loop_messages = messages %}
{%- elif messages[0].role == "developer" or messages[0].role == "system" %}
    {%- set developer_message = messages[0].content %}
    {%- set loop_messages = messages[1:] %}
{%- else %}
    {%- set developer_message = "" %}
    {%- set loop_messages = messages %}
{%- endif %}

{#- Render developer message #}
{%- if developer_message or tools %}
    {{- "<|start|>developer<|message|>" }}
    {%- if developer_message %}
        {{- "# Instructions\n\n" }}
        {{- developer_message }}
    {%- endif %}
    {%- if tools -%}
        {%- if developer_message %}
            {{- "\n\n" }}
        {%- endif %}
        {{- "# Tools\n\n" }}
        {{- render_tool_namespace("functions", tools) }}
    {%- endif -%}
    {{- "<|end|>" }}
{%- endif %}

{#- Render messages #}
{%- set last_tool_call = namespace(name=none) %}
{%- for message in loop_messages -%}
    {#- At this point only assistant[/user/tool](http://localhost:8888/user/tool) messages should remain #}
    {%- if message.role == 'assistant' -%}
        {#- Checks to ensure the messages are being passed in the format we expect #}
        {%- if "thinking" in message %}
            {%- if "<|channel|>analysis<|message|>" in message.thinking or "<|channel|>final<|message|>" in message.thinking %}
                {{- raise_exception("You have passed a message containing <|channel|> tags in the thinking field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
            {%- endif %}
        {%- endif %}
        {%- if "tool_calls" in message %}
            {#- We need very careful handling here - we want to drop the tool call analysis message if the model #}
            {#- has output a later <|final|> message, but otherwise we want to retain it. This is the only case #}
            {#- when we render CoT[/analysis](http://localhost:8888/analysis) messages in inference. #}
            {%- set future_final_message = namespace(found=false) %}
            {%- for future_message in loop_messages[loop.index:] %}
                {%- if future_message.role == 'assistant' and "tool_calls" not in future_message %}
                    {%- set future_final_message.found = true %}
                {%- endif %}
            {%- endfor %}
            {#- We assume max 1 tool call per message, and so we infer the tool call name #}
            {#- in "tool" messages from the most recent assistant tool call name #}
            {%- set tool_call = message.tool_calls[0] %}
            {%- if tool_call.function %}
                {%- set tool_call = tool_call.function %}
            {%- endif %}
            {%- if message.content and message.thinking %}
                {{- raise_exception("Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.") }}
            {%- elif message.content and not future_final_message.found %}
                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}
            {%- elif message.thinking and not future_final_message.found %}
                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
            {%- endif %}
            {{- "<|start|>assistant to=" }}
            {{- "functions." + tool_call.name + "<|channel|>commentary " }}
            {{- (tool_call.content_type if tool_call.content_type is defined else "json") + "<|message|>" }}
            {%- if tool_call.arguments is string %}
                {{- tool_call.arguments }}
            {%- else %}
                {{- tool_call.arguments|tojson }}
            {%- endif %}
            {{- "<|call|>" }}
            {%- set last_tool_call.name = tool_call.name %}
        {%- elif loop.last and not add_generation_prompt %}
            {#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
            {#- This is a situation that should only occur in training, never in inference. #}
            {%- if "thinking" in message %}
                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
            {%- endif %}
            {#- <|return|> indicates the end of generation, but <|end|> does not #}
            {#- <|return|> should never be an input to the model, but we include it as the final token #}
            {#- when training, so the model learns to emit it. #}
            {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
        {%- elif "thinking" in message %}
            {#- CoT is dropped during all previous turns, so we never render it for inference #}
            {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}
            {%- set last_tool_call.name = none %}
        {%- else %}
            {#- CoT is dropped during all previous turns, so we never render it for inference #}
            {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
            {%- set last_tool_call.name = none %}
        {%- endif %}
    {%- elif message.role == 'tool' -%}
        {%- if last_tool_call.name is none %}
            {{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
        {%- endif %}
        {{- "<|start|>functions." + last_tool_call.name }}
        {%- if message.content is string %}
            {{- " to=assistant<|channel|>commentary<|message|>" + message.content + "<|end|>" }}
        {%- else %}
            {{- " to=assistant<|channel|>commentary<|message|>" + message.content|tojson + "<|end|>" }}
        {%- endif %}
    {%- elif message.role == 'user' -%}
        {{- "<|start|>user<|message|>" + message.content + "<|end|>" }}
    {%- endif -%}
{%- endfor -%}

{#- Generation prompt #}
{%- if add_generation_prompt -%}
<|start|>assistant
{%- endif -%}
{# Copyright 2025-present Unsloth. Apache 2.0 License. Unsloth chat template fixes. Edited from ggml-org & OpenAI #}
Using chat eos_token: <|return|>
Using chat bos_token: <|startoftext|>

Model initialized successfully.
Attempting to embed a simple text...

llama_perf_context_print:        load time =     386.83 ms
llama_perf_context_print: prompt eval time =     382.55 ms [/](http://localhost:8888/)    11 tokens (   34.78 ms per token,    28.75 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms [/](http://localhost:8888/)     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =     388.28 ms [/](http://localhost:8888/)    12 tokens
llama_perf_context_print:    graphs reused =          0

Embedding successful!
Vector dimension: 2880
First 5 elements: [4.04585599899292, -0.3196412920951843, 2.1197192668914795, 0.38416096568107605, 3.550388813018799]

Of course. Based on the Python script and the detailed log output, here is a full analysis of what's happening.

This is a great, detailed log for debugging! The key takeaway is that while your script runs without crashing, it's failing to offload any model layers to your AMD GPU and is running entirely on the CPU.


Why is This Happening & How to Fix It

The root cause is an incompatibility between the llama-cpp-python Vulkan backend, your specific GPU drivers, and the quantization formats within that particular GGUF file.

Here are the most likely solutions, in order of what you should try first:

1. Reinstall llama-cpp-python with the Correct Backend Flags

Your current installation might not have been built correctly with full Vulkan support. The best way to ensure this is to force a recompile from source.

# Uninstall the current version first
pip uninstall llama-cpp-python -y

# Reinstall with CMAKE arguments to force the Vulkan backend
# This tells the installer to specifically build with Vulkan support
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install --force-reinstall --no-cache-dir llama-cpp-python

After reinstalling, run your test script again and check the logs to see if it now offloads layers.

2. Try a Different Model Quantization

The Unsloth model you're using has a mix of quantization types (q5_0, q8_0, q4_K, mxfp4). It's possible the Vulkan backend has poor support for one of these, especially the less common mxfp4.

Try downloading a more "standard" GGUF model, for example, from the popular creator "TheBloke." A Q4_K_M or Q5_K_M quant from one of his models is highly likely to be compatible. This will help you determine if the issue is with the model file itself or your environment.

3. (Alternative) Build with ROCm Instead of Vulkan

ROCm is AMD's more direct equivalent to NVIDIA's CUDA. It can sometimes be more stable and performant than the Vulkan backend. If step 1 doesn't work, you could try building with ROCm support. This is more involved as it requires installing the ROCm toolkit from AMD first.

Once ROCm is installed, the pip command would be:

# Make sure to uninstall the old version first
pip uninstall llama-cpp-python -y

# This command is for AMD GPUs with ROCm
CMAKE_ARGS="-DLLAMA_ROCM=on" pip install --force-reinstall --no-cache-dir llama-cpp-python

Your next step of converting the safetensors yourself is a good idea, but I would strongly recommend trying Step 1 first, as an improper build of llama-cpp-python is the most common cause of this exact issue.

After building and re-installing llama-cpp-python with the correct flag, the model (gpt-oss 20b converted with llama.cpp) seems to be loaded onto the GPU properly. However the error persists. What strikes me the most is, that the error occurs even when running on the cpu. I'll downgrade llama.cpp to the version supported by llama-cpp-python next.

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon 890M Graphics) - 20822 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 459 tensors from [/home/simon/.cache/huggingface/hub/models--openai--gpt-oss-20b/snapshots/6cee5e81ee83917806bbde320786a8fb61efebee/gpt-oss-20b-32x2.4B-6cee5e81ee83917806bbde320786a8fb61efebee-F16.gguf](http://localhost:8888/home/simon/.cache/huggingface/hub/models--openai--gpt-oss-20b/snapshots/6cee5e81ee83917806bbde320786a8fb61efebee/gpt-oss-20b-32x2.4B-6cee5e81ee83917806bbde320786a8fb61efebee-F16.gguf) (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys[/values.](http://localhost:8888/values.) Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = gpt-oss-20b
llama_model_loader: - kv   3:                           general.finetune str              = 6cee5e81ee83917806bbde320786a8fb61efebee
llama_model_loader: - kv   4:                         general.size_label str              = 32x2.4B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["vllm", "text-generation"]
llama_model_loader: - kv   7:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv   8:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv   9:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv  10:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  11:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  12:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  14:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  16:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  17:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  18:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  19:                          general.file_type u32              = 1
llama_model_loader: - kv  20:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  21:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  22:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  23:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  24: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 200002
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {#-\n  In addition to the normal input...
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type  f16:   98 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 12.83 GiB (5.27 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 200017 '<|reserved_200017|>' is not marked as EOG
load: control token: 200014 '<|reserved_200014|>' is not marked as EOG
load: control token: 200011 '<|reserved_200011|>' is not marked as EOG
load: control token: 200009 '<|reserved_200009|>' is not marked as EOG
load: control token: 200008 '<|message|>' is not marked as EOG
load: control token: 200006 '<|start|>' is not marked as EOG
load: control token: 200004 '<|reserved_200004|>' is not marked as EOG
load: control token: 200003 '<|constrain|>' is not marked as EOG
load: control token: 200000 '<|reserved_200000|>' is not marked as EOG
load: control token: 200005 '<|channel|>' is not marked as EOG
load: control token: 200010 '<|reserved_200010|>' is not marked as EOG
load: control token: 200016 '<|reserved_200016|>' is not marked as EOG
load: control token: 200013 '<|reserved_200013|>' is not marked as EOG
load: control token: 199998 '<|startoftext|>' is not marked as EOG
load: control token: 200018 '<|endofprompt|>' is not marked as EOG
load: control token: 200001 '<|reserved_200001|>' is not marked as EOG
load: control token: 200015 '<|reserved_200015|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = ?B
print_info: model params     = 20.91 B
print_info: general.name     = gpt-oss-20b
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 200002 '<|return|>'
print_info: EOT token        = 199999 '<|endoftext|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device Vulkan0, is_swa = 1
load_tensors: layer   1 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   2 assigned to device Vulkan0, is_swa = 1
load_tensors: layer   3 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   4 assigned to device Vulkan0, is_swa = 1
load_tensors: layer   5 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   6 assigned to device Vulkan0, is_swa = 1
load_tensors: layer   7 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   8 assigned to device Vulkan0, is_swa = 1
load_tensors: layer   9 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  10 assigned to device Vulkan0, is_swa = 1
load_tensors: layer  11 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  12 assigned to device Vulkan0, is_swa = 1
load_tensors: layer  13 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  14 assigned to device Vulkan0, is_swa = 1
load_tensors: layer  15 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  16 assigned to device Vulkan0, is_swa = 1
load_tensors: layer  17 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  18 assigned to device Vulkan0, is_swa = 1
load_tensors: layer  19 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  20 assigned to device Vulkan0, is_swa = 1
load_tensors: layer  21 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  22 assigned to device Vulkan0, is_swa = 1
load_tensors: layer  23 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  24 assigned to device Vulkan0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (f16) (and 0 others) cannot be used with preferred buffer type Vulkan_Host, using CPU instead
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25[/25](http://localhost:8888/25) layers to GPU
load_tensors:      Vulkan0 model buffer size = 12036.67 MiB
load_tensors:   CPU_Mapped model buffer size =  1104.61 MiB
....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 9182
llama_context: n_ctx_per_seq = 9182
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_per_seq (9182) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: Vulkan_Host  output buffer size =     0.78 MiB
create_memory: n_ctx = 9184 (padded)
llama_kv_cache_unified_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 9184 cells
llama_kv_cache_unified: layer   0: skipped
llama_kv_cache_unified: layer   1: dev = Vulkan0
llama_kv_cache_unified: layer   2: skipped
llama_kv_cache_unified: layer   3: dev = Vulkan0
llama_kv_cache_unified: layer   4: skipped
llama_kv_cache_unified: layer   5: dev = Vulkan0
llama_kv_cache_unified: layer   6: skipped
llama_kv_cache_unified: layer   7: dev = Vulkan0
llama_kv_cache_unified: layer   8: skipped
llama_kv_cache_unified: layer   9: dev = Vulkan0
llama_kv_cache_unified: layer  10: skipped
llama_kv_cache_unified: layer  11: dev = Vulkan0
llama_kv_cache_unified: layer  12: skipped
llama_kv_cache_unified: layer  13: dev = Vulkan0
llama_kv_cache_unified: layer  14: skipped
llama_kv_cache_unified: layer  15: dev = Vulkan0
llama_kv_cache_unified: layer  16: skipped
llama_kv_cache_unified: layer  17: dev = Vulkan0
llama_kv_cache_unified: layer  18: skipped
llama_kv_cache_unified: layer  19: dev = Vulkan0
llama_kv_cache_unified: layer  20: skipped
llama_kv_cache_unified: layer  21: dev = Vulkan0
llama_kv_cache_unified: layer  22: skipped
llama_kv_cache_unified: layer  23: dev = Vulkan0
llama_kv_cache_unified:    Vulkan0 KV buffer size =   215.25 MiB
llama_kv_cache_unified: size =  215.25 MiB (  9184 cells,  12 layers,  1[/1](http://localhost:8888/1) seqs), K (f16):  107.62 MiB, V (f16):  107.62 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 9184 cells
llama_kv_cache_unified: layer   0: dev = Vulkan0
llama_kv_cache_unified: layer   1: skipped
llama_kv_cache_unified: layer   2: dev = Vulkan0
llama_kv_cache_unified: layer   3: skipped
llama_kv_cache_unified: layer   4: dev = Vulkan0
llama_kv_cache_unified: layer   5: skipped
llama_kv_cache_unified: layer   6: dev = Vulkan0
llama_kv_cache_unified: layer   7: skipped
llama_kv_cache_unified: layer   8: dev = Vulkan0
llama_kv_cache_unified: layer   9: skipped
llama_kv_cache_unified: layer  10: dev = Vulkan0
llama_kv_cache_unified: layer  11: skipped
llama_kv_cache_unified: layer  12: dev = Vulkan0
llama_kv_cache_unified: layer  13: skipped
llama_kv_cache_unified: layer  14: dev = Vulkan0
llama_kv_cache_unified: layer  15: skipped
llama_kv_cache_unified: layer  16: dev = Vulkan0
llama_kv_cache_unified: layer  17: skipped
llama_kv_cache_unified: layer  18: dev = Vulkan0
llama_kv_cache_unified: layer  19: skipped
llama_kv_cache_unified: layer  20: dev = Vulkan0
llama_kv_cache_unified: layer  21: skipped
llama_kv_cache_unified: layer  22: dev = Vulkan0
llama_kv_cache_unified: layer  23: skipped
llama_kv_cache_unified:    Vulkan0 KV buffer size =   215.25 MiB
llama_kv_cache_unified: size =  215.25 MiB (  9184 cells,  12 layers,  1[/1](http://localhost:8888/1) seqs), K (f16):  107.62 MiB, V (f16):  107.62 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 3672
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:    Vulkan0 compute buffer size =  1215.14 MiB
llama_context: Vulkan_Host compute buffer size =    45.51 MiB
llama_context: graph nodes  = 1446
llama_context: graph splits = 2
CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
Model metadata:

After some more digging and trial and error I found out, that this bug may be related to an API change in llama.cpp which has not been taken into account by the llama-cpp-python developer (which has not been active since August). Embedding only one chunk ids = vector_store.add_documents(documents=all_splits[0:1]) seems to work. Then, when the model sequence id should be reset in order to process the next chunk, the old API is used to do that (kv_clear_something) and that fails. Hence the sequence_id is not (re)set and the error occurs. If that is the case, the error obviously resides in llama-cpp-python.

Your diagnosis is correct: the error happens because llama-cpp-python fails to properly reset the model's state (the KV cache) between documents when processing a batch. It attempts to use an outdated API call to clear the context, which fails, leading to a context overflow error on the second chunk.

Solutions

Here are a few ways to work around this issue, from the simplest to the most robust.

1. The Simple Loop (Recommended Workaround)

Instead of passing the entire list of documents to add_documents at once, you can simply loop through the list and add them one by one. This forces a new, clean embedding process for each document, completely bypassing the batching bug.

Change this:

Python

# Fails because it triggers the batching bug
ids = vector_store.add_documents(documents=all_splits)

To this:

# Works by adding documents one at a time
print(f"Adding {len(all_splits)} document chunks to the vector store...")
for doc in all_splits:
    vector_store.add_documents([doc]) # Pass a list containing just one document
print("All documents added successfully.")

This is the most direct and reliable fix for the code you've provided.

2. Update llama-cpp-python

The bug you've encountered has been a known issue. While the original developer may have been inactive, the community has often stepped in. Try forcing a re-install of the latest version, which may have a patch.

Make sure to specify the CMAKE_ARGS to enable the correct hardware acceleration for your system (CUDA in this example).

# Uninstall the old version first
pip uninstall llama-cpp-python -y

# Reinstall the latest version with hardware acceleration
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

After updating, try your original code again. If it's been patched, it should work.

3. Use a Different Embeddings Class

If the issue persists and you want a more stable solution for local embeddings, you can switch to a different LangChain class that doesn't depend on llama-cpp-python. The HuggingFaceEmbeddings class (which uses the highly-optimized sentence-transformers library) is an excellent alternative.

You would first need to download a model specifically designed for embeddings, like BAAI/bge-large-en-v1.5.

Example of switching:

# pip install sentence-transformers
from langchain_community.embeddings import HuggingFaceEmbeddings

# Replace LlamaCppEmbeddings with this
# model_name = "BAAI/bge-large-en-v1.5"
# model_kwargs = {"device": "cuda"} # or "cpu", "mps"
# encode_kwargs = {"normalize_embeddings": True}

# embeddings = HuggingFaceEmbeddings(
#     model_name=model_name,
#     model_kwargs=model_kwargs,
#     encode_kwargs=encode_kwargs
# )

# The rest of your code remains the same
# vector_store = PGVector(embeddings=embeddings, ...)
# ids = vector_store.add_documents(documents=all_splits) # This will now work correctly