EricLBuehler/mistral.rs

Model Wishlist

EricLBuehler opened this issue Β· 101 comments

Please let us know what model architectures you would like to be added!

Up to date todo list below. Please feel free to contribute any model, a PR without device mapping, ISQ, etc. will still be merged!

Language models

  • snowflake-arctic-instruct: Snowflake/snowflake-arctic-instruct
  • WizardLM-2: alpindale/WizardLM-2-8x22B
  • Command R: CohereForAI/c4ai-command-r-v01
  • Command R+: CohereForAI/c4ai-command-r-plus

Multimodal models

Embedding models

  • T5: google-t5/t5-base
  • nomic-text-embed: nomic-ai/nomic-embed-text-v1

qwen1.5-72B-Chat

llama3

@NiuBlibing, we have llama3 support ready: the README has a few examples. I will add Qwen support shortly.

@NiuBlibing, I just added Qwen2 support. Quantized Qwen2 support will be added in the next few days.

Hello!
Any plans for adding multimodal (e.g. llava) and embedding models?

Can you add https://huggingface.co/Snowflake/snowflake-arctic-instruct?

@cargecla1, yes! It will be a great use case for ISQ.

Hello!
Any plans for adding multimodal (e.g. llava) and embedding models?

@francis2tm, yes. I plan on supporting Llava and embedding models this week.

@NiuBlibing, you can run Qwen now with ISQ, which will quantize it.

Would be nice to support at least one strong vision-language model: https://huggingface.co/openbmb/MiniCPM-V-2 https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5 with an option to compute visual frontend model on CPU. You might find it easier to ship visual transformer part via onnx.

Would love to see some DeepSeek-VL, this model is better than Llava and spupports multiple images per prompt
https://huggingface.co/collections/deepseek-ai/deepseek-vl-65f295948133d9cf92b706d3

Also, outside the LLM world, would love to see support for https://github.com/cvg/LightGlue :) but not sure if that's possible ...

Could you add support to for GGUF quantized Phi-3-Mini to the wishlist? Currently, this fails (built from master):

Running `./mistralrs-server gguf -m PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed -t microsoft/Phi-3-mini-128k-instruct -f /home/jett/Downloads/llms/Phi-3-mini-128k-instruct-q3_K_S.gguf`
2024-04-29T03:08:35.180939Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: false
2024-04-29T03:08:35.180975Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-29T03:08:35.180982Z  INFO mistralrs_server: Loading model `microsoft/Phi-3-mini-128k-instruct` on Cpu...
2024-04-29T03:08:35.180989Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-04-29T03:08:35.181017Z  INFO hf_hub: Token file not found "/home/jett/.cache/huggingface/token"    
2024-04-29T03:08:35.181048Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/jett/.cache/huggingface/token", using no HF token.
2024-04-29T03:08:35.181122Z  INFO hf_hub: Token file not found "/home/jett/.cache/huggingface/token"    
2024-04-29T03:08:35.181133Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/jett/.cache/huggingface/token", using no HF token.
Error: Unknown GGUF architecture `phi3`

It'll be great to see WizardLM-2 and suzume. And thanks for a great tool!

W4G1 commented

Command-R and Command-R+ from Cohere would be amazing πŸ™

T5
LLAVA

@kir-gadjello

Would be nice to support at least one strong vision-language model: https://huggingface.co/openbmb/MiniCPM-V-2 https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5 with an option to compute visual frontend model on CPU. You might find it easier to ship visual transformer part via onnx.

Supporting a vision+language or multimodal model is very high priority right now.


@chelbos

Would love to see some DeepSeek-VL, this model is better than Llava and spupports multiple images per prompt
https://huggingface.co/collections/deepseek-ai/deepseek-vl-65f295948133d9cf92b706d3

I'll add this one too.

Also, outside the LLM world, would love to see support for https://github.com/cvg/LightGlue :) but not sure if that's possible ...

I will look into it!


@jett06

Could you add support to for GGUF quantized Phi-3-Mini to the wishlist?

Yes, absolutely, I think it should be easy. In the meantime, you can use ISQ to get the same speed.


@rodion-m

It'll be great to see WizardLM-2 and suzume. And thanks for a great tool!

Thanks! I think suzume is just finetuned Llama so that can be used already. I'll add WizardLM.


@W4G1

Command-R and Command-R+ from Cohere would be amazing πŸ™

Yes, I'll add those.


@yongkangzhao

T5 and LLaVA

Yes, I'll add those. T5 will be a nice smaller model.

@EricLBuehler Thanks for your reply, for adding my suggestion to the model wishlist, and for developing such an awesome project! It's very appreciated :)

ldt commented

Congrats for your great work!
+1 for vision models like Idefics2-8b or better would be awesome

it would be nice to add some embedding models like nomic-text-embed.

Hello, first of all, I want to express my appreciation for the excellent work your team has accomplished on the mistral.rs engine. It's a great project.

I am currently developing a personal AI assistant using Rust, and I believe integrating additional features into your engine could significantly enhance its utility and appeal. Specifically, adding support for Whisper and incorporating Text-to-Speech (TTS) functionalities, such as StyleTTS or similar technologies, would be incredibly beneficial. This would enable the engine to handle LLM inference, speech-to-text, and text-to-speech processes in a unified system very fast (near runtime).

Implementing these features could transform the engine into a more versatile tool for developers like myself, who are keen on building more integrated and efficient AI applications.

@jett06, I just added quantized GGUF Phi-3 support in #276! That is without LongRope support currently, but you can use a plain model with ISQ.

@EricLBuehler Woah, thank you so much! This will be lovely for us folks with less powerful computers or size constraints, you're awesome :)

@jett06, my pleasure! I just fixed a small bug (in case you saw the strange behavior), so it should be all ready to go now!

IBM's Granite series Code Models.

Granite Code Models

@NeroHin

IBM's Granite series Code Models.

Granite Code Models

The 3b and 8b variants should already be supported as they are just based on the llama architecture.

The 20b and 34b variants are based on the GPTBigCode architecture which currently isn't implemented in mistral.rs.

Hello! Any plans for adding multimodal (e.g. llava) and embedding models?

I'm working on it now.chenwanqq/candle-llava
It's not easy dude, tons of image preprocess and tensor concat.

I'm working on it now.chenwanqq/candle-llava
It's not easy dude, tons of image preprocess and tensor concat.

Yes! Would you be willing to contribute your implementation here once it's done?

I'm working on it now.chenwanqq/candle-llava
It's not easy dude, tons of image preprocess and tensor concat.

Yes! Would you be willing to contribute your implementation here once it's done?

Yes of course!

Yes of course!

Great, looking forward to that! I'm working on Idefics 2 here: #309.

YI-34B

Google T5 would be a really cool addition.

https://huggingface.co/microsoft/Phi-3-vision-128k-instruct would be nice.

@bachp, #309 is working on vision support. It adds a lot of infrastructure for that, so when I merge it I should be able to add that model soon!

Yes of course!

Great, looking forward to that! I'm working on Idefics 2 here: #309.

I have made significant progress in the candle-llava project. However, I hope you don't mind me contributing the code to the official candle project first, and then adapting some infrastructure changes in #309 when it have been settled. In addition, I have absolutely no experience with quantization or LoRA, so it may take some time for me to learn.

For embedding models it would be awesome to support snowflake-arctic-embed πŸ™

For embedding models it would be awesome to support snowflake-arctic-embed πŸ™

@nicarq, sure! I'll try to do it this weekend.

I have made significant progress in the candle-llava project. However, I hope you don't mind me contributing the code to the official candle project first, and then adapting some infrastructure changes in #309 when it have been settled. In addition, I have absolutely no experience with quantization or LoRA, so it may take some time for me to learn.

@chenwanqq that's fine! Actually, after I merge #309 or #351 (I'll let you know if that's all right), I think it would be enough for you to open a PR where you add your code without LoRA, quantization, device mapping, etc. Basically, you would do (hopefully minimal) porting work, and then I can add those features later. Does that sound good?

For embedding models it would be awesome to support snowflake-arctic-embed πŸ™

@nicarq, sure! I'll try to do it this weekend.

I have made significant progress in the candle-llava project. However, I hope you don't mind me contributing the code to the official candle project first, and then adapting some infrastructure changes in #309 when it have been settled. In addition, I have absolutely no experience with quantization or LoRA, so it may take some time for me to learn.

@chenwanqq that's fine! Actually, after I merge #309 or #351 (I'll let you know if that's all right), I think it would be enough for you to open a PR where you add your code without LoRA, quantization, device mapping, etc. Basically, you would do (hopefully minimal) porting work, and then I can add those features later. Does that sound good?

Great! I'll do my work!

Hi @chenwanqq! I noticed that your Candle PR was merged - congratulations! Would you have the time to contribute your Llava model here? If you do, I can add the LoRA/quantization/device mapping features in a later PR.

I would also like to see support for quantized models using the Importance Matrix below.
ggerganov/llama.cpp#4861

Hi @chenwanqq! I noticed that your Candle PR was merged - congratulations! Would you have the time to contribute your Llava model here? If you do, I can add the LoRA/quantization/device mapping features in a later PR.

No problem, just waiting for #309 ( I hope API of vision model can be as align as possible).

No problem, just waiting for #309 ( I hope API of vision model can be as align as possible).

Would a merge of #351 also work for that? The 2 PRs use the same API.

No problem, just waiting for #309 ( I hope API of vision model can be as align as possible).

Would a merge of #351 also work for that? The 2 PRs use the same API.

I think I will start this part of the work after this weekend. I spent some time the other day studying how to implement the GPU version of nonzero, but I encountered some difficulties. I think I will come back here before I figure it out.

Hi @chenwanqq I think that using the CPU-only approach for now should be fine. Please see our API:

/// Equivalent to: `torch.nonzero(x, as_tuple=False)`
///
/// This performs the operation on the CPU and as such triggers a device synchronization.
/// The output tensor is `DType::U8` and the device will be the same as the input.
pub fn nonzero<T: WithDType>(&self, x: &Tensor) -> Result<Tensor> {
let dev = x.device();
let x = x.to_vec2::<T>()?;
// lt & gt
let res = x
.par_iter()
.enumerate()
.flat_map(|(i, x_row)| {
x_row
.par_iter()
.enumerate()
.filter_map(|(j, x)| {
if *x != T::zero() {
Some(vec![i as u32, j as u32])
} else {
None
}
})
.collect::<Vec<_>>()
})
.map(|x| Tensor::from_slice(&x, (x.len(),), dev))
.collect::<Result<Vec<_>>>()?;
Tensor::stack(&res, 0)
}
}

Is Phi-3 Medium supported?

I get this error when I send the first message in interactive mode:

2024-06-08T18:58:31.530887Z ERROR mistralrs_core::engine: prompt step - Model failed with error: ShapeMismatchBinaryOp { lhs: [1, 16, 1280], rhs: [1, 16, 40, 128], op: "reshape" }
2024-06-08T18:58:31.531072Z ERROR mistralrs_server::interactive_mode: Got a model error: "shape mismatch in reshape, lhs: [1, 16, 1280], rhs: [1, 16, 40, 128]", response: ChatCompletionResponse { id: "0", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: "", role: "assistant" }, logprobs: None }], created: 1717873111, model: ".", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 16, total_tokens: 16, avg_tok_per_sec: 134.45378, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 0.119, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } }

Thanks

Hi @siriux, can you please raise an issue to track progress with the command you ran with RUST_BACKTRACE=1? Thanks you!

Edit: The following command works on commit f257423.

cargo run --release --features cuda -- -i plain -m microsoft/Phi-3-medium-128k-instruct -a phi3

Hi @EricLBuehler, I'm running it with a local gguf, not using huggingface, maybe that's the difference.

I've opened this issue with more details: #413

Thanks

Hi @chenwanqq I think that using the CPU-only approach for now should be fine. Please see our API:

/// Equivalent to: `torch.nonzero(x, as_tuple=False)`
///
/// This performs the operation on the CPU and as such triggers a device synchronization.
/// The output tensor is `DType::U8` and the device will be the same as the input.
pub fn nonzero<T: WithDType>(&self, x: &Tensor) -> Result<Tensor> {
let dev = x.device();
let x = x.to_vec2::<T>()?;
// lt & gt
let res = x
.par_iter()
.enumerate()
.flat_map(|(i, x_row)| {
x_row
.par_iter()
.enumerate()
.filter_map(|(j, x)| {
if *x != T::zero() {
Some(vec![i as u32, j as u32])
} else {
None
}
})
.collect::<Vec<_>>()
})
.map(|x| Tensor::from_slice(&x, (x.len(),), dev))
.collect::<Result<Vec<_>>>()?;
Tensor::stack(&res, 0)
}
}

I managed to write a GPU version of nonzero candle-nonzero.
Actually according to my benchmark test, rayon has too many context cost and it can't even compare with vanilla sequential computing.

That is great! Would it be OK if I add this into mistral.rs, or would you like to contribute it?

That is great! Would it be OK if I add this into mistral.rs, or would you like to contribute it?

I'll try to do it.😁
Since it will introduce some Cuda codes, the building process may be influenced. So please check it after I start a pull request.
(❁´◑`❁)

Ok, great.

Ok, great.

You can check my #422 . I hope you don't mind me modify the API of NonzeroπŸ™‰

Not a problem πŸ˜„

@NeroHin

IBM's Granite series Code Models.
Granite Code Models

The 3b and 8b variants should already be supported as they are just based on the llama architecture.

The 20b and 34b variants are based on the GPTBigCode architecture which currently isn't implemented in mistral.rs.

The 3b and 8b variants do not work out of the box, they rely on tie word embeddings (which I was able to get working in mistral.rs), but the BPE tokenizer breaks because there are some tokens in the vocab list that are > 255 characters.

+1 to getting support for GPTBigCode and other starcoder model variants.

@EricLBuehler I'm stil working on LLaVA. Meanwhile, with so much experience with rust and Candle, have you ever encountered any problem about memory usage? I have some kinds of confusion. huggingface/candle#2273 (comment)

@chenwanqq, that is great, let me know if I can help!

I replied to the discussion 2272. However, I discovered that the shadowing does mean that the big tensor will not get dropped! See this playground and my comment for more details.

I'll add a clippy lint here to avoid this on our end.