Implementing Code Llama 7B
flavienbwk opened this issue Β· 15 comments
Please describe the feature you want
Code Llama, released yesterday by Meta, is pretending better performance than GPT3.5 for code generation.
I saw the following project : https://huggingface.co/TabbyML/CodeLlama-7B
When is it scheduled to be released ?
Thanks a lot to the TabbyML team.
I have an under-development version hosted at https://huggingface.co/TabbyML/CodeLlama-7B. However, we are still working on implementing the tokenization of stop words for line breaks.
I will keep you updated on our progress regarding this issue.
Once #371 is merged and released in the daily docker build, TabbyML/CodeLlama-7B
shall work as intended.
Please note that this model is significantly larger (7B) compared to our current recommendation, such as SantaCoder-1B, for a T4 GPU.
@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.
@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.
By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.
@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.
By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.
Will there be further support for quantisation,like GPTQ to make even bigger models more useble?
@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.
By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.
I am surprised because I've tested on my Nvidia P100 16GB VRAM and the container returns :
2023-08-30T14:13:44.171766Z INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B`
Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/vocabulary.json
[00:00:00] [##################################################################] 496.94 KiB/496.94 KiB (2.92 MiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/tabby.json
[00:00:00] [############################################################################] 143B/143B (303.65 KiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/tokenizer.json
[00:00:00] [######################################################################] 1.76 MiB/1.76 MiB (5.33 MiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/config.json
[00:00:00] [############################################################################] 103B/103B (211.54 KiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/model.bin
[00:00:43] [##################################################################] 12.55 GiB/12.55 GiB (293.80 MiB/s, 0s)2023-08-30T14:14:29.690394Z INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes...
terminate called after throwing an instance of 'std::runtime_error'
what(): CUDA failed with error out of memory
With command :
docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda
Would you have any idea ?
For the code completion use case, a rough breaking threshold is around 10 billion parameters to determine whether tensor parallelism (model parallelism) is necessary for reasonable latency. Therefore, it's unlikely that we will invest significant effort in that direction.
As for FAQ use cases, since the latency requirements are considerably more relaxed in this scenario, we are very interested in exploring inference with GPT-Q.
I've maybe forgotten the --compute-type
option but it outputs this error :
# docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda --compute-type int8
2023-08-30T14:27:23.824958Z INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B`
2023-08-30T14:27:23.826825Z INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes...
terminate called after throwing an instance of 'std::invalid_argument'
what(): Requested int8 compute type, but the target device or backend do not support efficient int8 computation.
I've maybe forgotten the
--compute-type
option but it outputs this error :# docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda --compute-type int8 2023-08-30T14:27:23.824958Z INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B` 2023-08-30T14:27:23.826825Z INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes... terminate called after throwing an instance of 'std::invalid_argument' what(): Requested int8 compute type, but the target device or backend do not support efficient int8 computation.
Could you share your cuda setup? maybe attached output of nvidia-smi
?
Here it is :
# nvidia-smi
Wed Aug 30 14:40:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:01:00.0 Off | 0 |
| N/A 36C P0 26W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Is the Pascal architecture too old ?
Is the Pascal architecture too old ?
Yes - int8 precision requires cuda compute capability >= 7.0 or 6.1. While P100 has compute capability of 6.0 thus could only utilize float32 inference.
I added a faq section on website to further elaborate this: https://tabbyml.github.io/tabby/docs/faq
Very clear, thank you.
Confirmed working on RTX3070 with 6849MiB / 8192MiB
of VRAM.
For the code completion use case, a rough breaking threshold is around 10 billion parameters to determine whether tensor parallelism (model parallelism) is necessary for reasonable latency. Therefore, it's unlikely that we will invest significant effort in that direction.
As for FAQ use cases, since the latency requirements are considerably more relaxed in this scenario, we are very interested in exploring inference with GPT-Q.
As far as I know,there is still something can be done to decrease latency on larger models,with some tricks to overcome vram bandwidth bottleneck and increase GPU utilization.
For example,the assisted generation trick discussed in this blog:https://huggingface.co/blog/assisted-generation
Which achieved ~2x speed up in single stream generation.Such speed up could be more substantial on vram bandwidth starved gamming GPUs(for those RTX cards).In this way,running a ~15 or even ~30B sota model with reasonable latency might be achiveableπ€
https://x.com/ggerganov/status/1694775472658198604?s=46
might worth prioritize llasupport integrating since the speculative decoding (assisted generation) gives such a high performance bump β¦β¦