Can't create a session (local model)
Opened this issue · 11 comments
System Info
transformers.js 2.17.2
Environment/Platform
- Website/web-app
- Browser extension
- Server-side (e.g., Node.js, Deno, Bun)
- Desktop app (e.g., Electron)
- Other (e.g., VSCode extension)
Description
I have my model available under the /models/tokenizer
and /models/onnx/onnx
paths.
I'm loading the tokenizer with:
const tokenizerPath = '/tokenizer/';
tokenizer = await AutoTokenizer.from_pretrained(tokenizerPath);
And the model with:
const modelPath = '/onnx/';
model = await AutoModel.from_pretrained(modelPath, {
model_file_name: "model",
quantized: false
});
I have my model under the 2 paths because it always adds /onnx
a second time. I also had to set quantized
to false because otherwise it was adding a suffix to the file.
I found it very difficult to find a way to get it loading the model on the right path.
Anyway, it's now getting the tokenizer and model files correctly, but then I get this error in the browser console (same with chrome and firefox).
Error: Can't create a session
Reproduction
I can't provide full reproduction steps because I'm using a local model.
But the code is:
import { AutoModel, AutoTokenizer, env } from '@xenova/transformers';
let model;
let tokenizer;
// Function to load the tokenizer
async function loadTokenizer() {
try {
// Path to your tokenizer files
const tokenizerPath = '/tokenizer/';
// Initialize tokenizer
tokenizer = await AutoTokenizer.from_pretrained(tokenizerPath);
console.log('Tokenizer loaded successfully.');
} catch (error) {
console.error('Failed to load the tokenizer:', error);
}
}
// Function to load the ONNX model
async function loadModel() {
try {
// Path to your ONNX model
const modelPath = '/onnx/';
// Initialize model
//env.remoteHost = 'https://hf-mirror.com';
model = await AutoModel.from_pretrained(modelPath, {
model_file_name: "model",
quantized: false
});
console.log('ONNX Model loaded successfully.');
} catch (error) {
console.error('Failed to load the model:', error);
}
}
I've created a small snippet to check my onnx model is fine:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("./public/models/onnx/onnx")
model = ORTModelForCausalLM.from_pretrained("./public/models/onnx/onnx", use_cache=False, use_io_binding=False)
inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")
gen_tokens = model.generate(**inputs,do_sample=True,temperature=0.9, min_length=20,max_length=20)
response = tokenizer.batch_decode(gen_tokens)
print("Generated text:", response)
So it works with optimum.onnxruntime
and I'd like to understand how to make it work with transformers.js
Hey, I've done some work with custom models.
My workflow looks something like this.
The most important thing to note when using pretrained models is to ensure the structure is correct.
This is just public_root/models/[name_of_model]/onnx/model.onnx
.
My folder structure for the model (which is U2Net
) looks like this.
public/models/u2netp/onnx/model.onnx
public/models/u2netp/config.json
public/models/u2netp/preprocessor_config.json
It's also possible to have a model_quantized.onnx
that will be the default to AutoModel.
Then to reference that model I have the following.
import { env, AutoModel, AutoProcessor, RawImage } from '@xenova/transformers';
// Force transformers to only look locally and not make any fetch requests.
env.allowLocalModels = true;
env.allowRemoteModels = false;
async function main() {
// Create the processor.
// The name here, should match the name of the folder in `models`
const processor = await AutoProcessor.from_pretrained('u2netp')
.catch(error => new Error(error));
if (processor instanceof Error) {
console.log(processor.message);
return;
}
// U2Net is an image based model, so you might skip this step.
const url = 'https://example.com/test.png';
const image = await RawImage.fromURL(url)
.catch(error => new Error(error));
if (image instanceof Error) {
console.error(image.message);
return;
}
// Preprocess the image.
const processed = await processor(image);
// Create the model, again the name should match the name of the folder.
// I am passing quantized: false because I do not have a
// `model_quantized.onnx` within the folder.
const model = await AutoModel.from_pretrained('u2netp', {
quantized: false,
});
// Get the outputs of the model.
const outputs = await model({ 'input': processed });
}
main();
Thanks @BritishWerewolf
I have the right structure now, but I still get this error. I think it's caused by the way the model is converted.
I also get this warning before the error:
onnxruntime::model_load_utils::ValidateOpsetForDomain(const std::unordered_map<std::string, int> &, const logging::Logger &, bool, const std::string &, int) ONNX Runtime only *guarantees* support for models stamped with official released onnx opset versions. Opset 5 is under development and support for this is limited. The operator schemas and or other functionality may change before next ONNX release and in this case ONNX Runtime will not guarantee backward compatibility. Current official support for domain ai.onnx.ml is till opset 3.
I've tried changing the opset value as well when exporting my model to onnx (from gguf) with the optimum-cli but could find a way to make it working.
So maybe the problem is coming from the way I generate my original gguf mode. I'm using unsloth to train the model and export it to gguf.
It looks like you have everything nearly there, but your ONNX model is using a under development opset.
To be clear, are you saying you did something like this:
optimum-cli export onnx --model path_to_gguf_model --output path_to_output_directory --opset 3
I think setting to opset 3 is important.
Other than that, I have not worked with GGUF so not entirely sure how to help.
It seems that if you can export or convert to ONNX then the rest of the code will work.
I used --opset 18
.
I came to the conclusing I should use this value when reading https://onnxruntime.ai/docs/reference/compatibility.html#onnx-opset-support
I've also tried with 3 because of the warning, but in that case I got:
Opset 3 is lower than the recommended minmum opset (14) to export llama. The ONNX export may fail or the exported model may be suboptimal.
...
ValueError: Unsupported ONNX opset version: 3
I switched to @huggingface/transformers
and now I get:
app.js:46 Failed to load the model: 20419424
I'm also getting the same error if I try to use the onnxruntime-web package directly.
What are the specifics of the model?
Looking at the onnx-runtime the following issue was created:
microsoft/onnxruntime-genai#761 (comment)
Currently, ONNX Runtime GenAI's model builder only supports converting float16/float32 GGUF models and not already-quantized GGUF models. If you have the original float16/float32 weights in a GGUF file, you can try using that to get the ONNX model.
Does any of that help?
Are you using float16
?
Thanks @BritishWerewolf
I'm going to try to disable quantization.
Same without quantization in the original gguf model
Can you help me to understand GGUF?
How do you create models? I am wondering if I can replicate something on my machine.
I'm using unsloth to train my model and export it to gguf.
Here is the code I use:
from unsloth import FastLanguageModel
import torch
import json
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2-1B-Instruct",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
from datasets import Dataset, DatasetDict
# Load the JSON file manually
with open("prompts.json", "r") as file:
data = json.load(file)
# Function to flatten and format the conversations
def format_conversations(data):
formatted_convos = []
for convo in data:
formatted_convo = [{"from": message["from"], "value": message["value"]} for message in convo]
formatted_convos.append({"conversations": formatted_convo})
return formatted_convos
# Apply formatting
formatted_data = format_conversations(data)
# Convert to Dataset
dataset = Dataset.from_list(formatted_data)
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "llama-3",
mapping = {"role" : "from", "content" : "value", "user" : "user", "assistant" : "assistant"},
map_eos_token = True,
)
def formatting_prompts_func(convos):
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos["conversations"]]
return { "text" : texts, }
pass
dataset = dataset.map(formatting_prompts_func, batched = True,)
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 1000,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)
trainer_stats = trainer.train()
model.save_pretrained("model")
tokenizer.save_pretrained("model")
model.save_pretrained_gguf("model_gguf", tokenizer, quantization_method="not_quantized")