rustformers/llm

Good ideas from llama.cpp

setzer22 opened this issue ยท 11 comments

I've been tracking the llama.cpp repo. I'll use this issue to list any good ideas / things we should be aware of to keep up with in Rust land:

  • GPTQ quantization ๐Ÿ‘€ ggerganov/llama.cpp#9
  • Not sure how that is even possible (isn't the task I/O bound?), but people are claiming great speedups when loading the modelling in parallel. This should be pretty easy to implement using rayon. ggerganov/llama.cpp#85 (comment)
  • Seems there's an issue with the normalization function used. It should be RMSNorm. Would be good to keep an eye on this, and simply swap the the ggml function once it's implemented on the C++ side ๐Ÿ‘€ ggerganov/llama.cpp#173 (comment)
  • It looks like dropping to F16 for the memory_k and memory_v reduces memory usage. It is not known whether this hurts quality, but we should follow the C++ side and add a flag to drop to F16 for the memory. This would also make the cached prompts added as part of #14 take half the size on disk, which is a nice bonus: ggerganov/llama.cpp#154 (review)
  • Looks like the fix from #1 just landed upstream. We should make sure to fix it here too ggerganov/llama.cpp#161
  • The tokenizer used in llama.cpp has some issues. It would be better to use sentencepiece, which is the one that was used during the original LLaMA training. There seems to be a rust crate for sentencepiece. We should check if a drop-in replacement is possible ggerganov/llama.cpp#167

Suggest pinning this issue :>

For the tokenizer item I suggest using https://github.com/huggingface/tokenizers/

Should work out of the box once converted (when this PR lands: huggingface/transformers#21955 it should become a simple let tokenizer = Tokenizer::from_file("filename") ) Cheers!

RMS norm landed, but they've reported regressions. Need to keep an eye on that.

@Narsil Llamatokenizer need to byte fallback option.๐Ÿฅน

For the tokenizer item I suggest using https://github.com/huggingface/tokenizers/

Should work out of the box once converted (when this PR lands: huggingface/transformers#21955 it should become a simple let tokenizer = Tokenizer::from_file("filename") ) Cheers!

Good news everyone !

huggingface/tokenizers#1183

(If this goes, I'll try to make a release soon after)

Awesome! Looking forward to it :D

A small comment on the parallel loading: It is definitely possible to improve IO reads by parallelizing. This is much more effective on SSDs but still works on HDDs due to caching at different layers. However this should be configurable since the performance can start to degrade at certain points of parallelism, depending on the storage medium and also stuff like the kernel and buffer sizes

@dnlmlr Do you have bench to back that up ? I didn't found that to be the case whenever I tried.

Memory-mapping was always consistently better than reading a file (Provided you need the whole file) and it doesn't require parallism (at user-level that is, no idea how the kernel is handling it)

@setzer22 Are you okay with me closing this issue and splitting it into individual issues?

Yup, sounds good ๐Ÿ‘

This issue has been superseded by #35, #62, #78, #79 and #80.