TabbyML/tabby

Implementing Code Llama 7B

flavienbwk opened this issue Β· 15 comments

Please describe the feature you want

Code Llama, released yesterday by Meta, is pretending better performance than GPT3.5 for code generation.

I saw the following project : https://huggingface.co/TabbyML/CodeLlama-7B

When is it scheduled to be released ?

Thanks a lot to the TabbyML team.

I have an under-development version hosted at https://huggingface.co/TabbyML/CodeLlama-7B. However, we are still working on implementing the tokenization of stop words for line breaks.

I will keep you updated on our progress regarding this issue.

Once #371 is merged and released in the daily docker build, TabbyML/CodeLlama-7B shall work as intended.

Please note that this model is significantly larger (7B) compared to our current recommendation, such as SantaCoder-1B, for a T4 GPU.

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.

Will there be further support for quantisation,like GPTQ to make even bigger models more useble?

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.

I am surprised because I've tested on my Nvidia P100 16GB VRAM and the container returns :

2023-08-30T14:13:44.171766Z  INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B`
Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/vocabulary.json
  [00:00:00] [##################################################################] 496.94 KiB/496.94 KiB (2.92 MiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/tabby.json
  [00:00:00] [############################################################################] 143B/143B (303.65 KiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/tokenizer.json
  [00:00:00] [######################################################################] 1.76 MiB/1.76 MiB (5.33 MiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/config.json
  [00:00:00] [############################################################################] 103B/103B (211.54 KiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/model.bin
  [00:00:43] [##################################################################] 12.55 GiB/12.55 GiB (293.80 MiB/s, 0s)2023-08-30T14:14:29.690394Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes...
terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA failed with error out of memory

With command :

docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda

Would you have any idea ?

For the code completion use case, a rough breaking threshold is around 10 billion parameters to determine whether tensor parallelism (model parallelism) is necessary for reasonable latency. Therefore, it's unlikely that we will invest significant effort in that direction.

As for FAQ use cases, since the latency requirements are considerably more relaxed in this scenario, we are very interested in exploring inference with GPT-Q.

I've maybe forgotten the --compute-type option but it outputs this error :

# docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda --compute-type int8
2023-08-30T14:27:23.824958Z  INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B`
2023-08-30T14:27:23.826825Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes...
terminate called after throwing an instance of 'std::invalid_argument'
  what():  Requested int8 compute type, but the target device or backend do not support efficient int8 computation.

I've maybe forgotten the --compute-type option but it outputs this error :

# docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda --compute-type int8
2023-08-30T14:27:23.824958Z  INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B`
2023-08-30T14:27:23.826825Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes...
terminate called after throwing an instance of 'std::invalid_argument'
  what():  Requested int8 compute type, but the target device or backend do not support efficient int8 computation.

Could you share your cuda setup? maybe attached output of nvidia-smi?

Here it is :

# nvidia-smi
Wed Aug 30 14:40:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   36C    P0    26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Is the Pascal architecture too old ?

Is the Pascal architecture too old ?

Yes - int8 precision requires cuda compute capability >= 7.0 or 6.1. While P100 has compute capability of 6.0 thus could only utilize float32 inference.

I added a faq section on website to further elaborate this: https://tabbyml.github.io/tabby/docs/faq

Very clear, thank you.

Confirmed working on RTX3070 with 6849MiB / 8192MiB of VRAM.

For the code completion use case, a rough breaking threshold is around 10 billion parameters to determine whether tensor parallelism (model parallelism) is necessary for reasonable latency. Therefore, it's unlikely that we will invest significant effort in that direction.

As for FAQ use cases, since the latency requirements are considerably more relaxed in this scenario, we are very interested in exploring inference with GPT-Q.

As far as I know,there is still something can be done to decrease latency on larger models,with some tricks to overcome vram bandwidth bottleneck and increase GPU utilization.
For example,the assisted generation trick discussed in this blog:https://huggingface.co/blog/assisted-generation
Which achieved ~2x speed up in single stream generation.Such speed up could be more substantial on vram bandwidth starved gamming GPUs(for those RTX cards).In this way,running a ~15 or even ~30B sota model with reasonable latency might be achiveableπŸ€”

https://x.com/ggerganov/status/1694775472658198604?s=46

might worth prioritize llasupport integrating since the speculative decoding (assisted generation) gives such a high performance bump ……