Single GPU vs multiple GPU (tensor parallel) suggestion for API Server

Question

Single GPU vs multiple GPU (tensor parallel) suggestion for API Server

Opened this issue 3 months ago · 3 comments

Hi there and first of all thanks for this great tool!

I was wondering if you could provide any feedback about having a single RTX 4090 24GB vs 4x 4060ti 16GB.
At the end 4x 4060ti stack tensor cores count will match 4090 tensor cores and the 4x4060ti stack will have a total 64GB of RAM instead of 24GB.
Can't tell if the 4x 4060ti stack memory bandwidth will be a bottleneck compared to a single 4090.

One last thing, will AMD GPUs be supported one day?

Thanks in advance!

Answer 1 · 2024-03-25T03:36:55.000Z

If you use 4x4060ti the speed of inference might be slow given the communication overhead between the GPU's (not tried a benchmark but just based on theoretical knowledge). If you have 24GB GPU better use that one.

In the backend if you look at the packages openchat uses pytorch, vLLM and Ray, so if you can configure the underlying libraries for AMD GPU (RCOM) then you should be able to use openchat model with AMD GPU, theoretically like the way they are now supported by ollama (because of all the hardwork done by llama.cpp to support AMD GPU's, given ollama is a wrapper around that library).

You can still run openchat models using llama.cpp with AMD GPU using following guide:

sudo apt-get update
sudo apt-get upgrade


sudo mkdir --parents --mode=0755 /etc/apt/keyrings

wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \
    gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
   
# Kernel driver repository for jammy
sudo tee /etc/apt/sources.list.d/amdgpu.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/5.7.1/ubuntu jammy main
EOF
# ROCm repository for jammy
sudo tee /etc/apt/sources.list.d/rocm.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/debian jammy main
EOF
# Prefer packages from the rocm repository over system packages
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600

sudo apt-get update

sudo apt-get install amdgpu-dkms

sudo apt-get install rocm-hip-libraries

sudo reboot

sudo apt-get install rocm-dev
sudo apt-get install rocm-hip-runtime-dev rocm-hip-sdk
sudo apt-get install rocm-libs
sudo apt-get install rocminfo


ROCk module is loaded
=====================   
HSA System Attributes   
=====================   
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                             
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                 
Agent 1                 
*******                 
  Name:                    AMD Ryzen 5 2600X Six-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 5 2600X Six-Core Processor
  Vendor Name:             CPU                               
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                             
  Node:                    0                                 
  Device Type:             CPU                               
  Cache Info:             
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                 
  Internal Node ID:        0                                 
  Compute Unit:            12                                 
  SIMDs per CU:            0                                 
  Shader Engines:          0                                 
  Shader Arrs. per Eng.:   0                                 
  WatchPts on Addr. Ranges:1                                 
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED       
      Size:                    32792028(0x1f45ddc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32792028(0x1f45ddc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED     
      Size:                    32792028(0x1f45ddc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                               
  ISA Info:               
*******                 
Agent 2                 
*******                 
  Name:                    gfx1031                           
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6750 XT             
  Vendor Name:             AMD                               
  Feature:                 KERNEL_DISPATCH                   
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                         
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                   
  Queue Type:              MULTI                             
  Node:                    1                                 
  Device Type:             GPU                               
  Cache Info:             
    L1:                      16(0x10) KB                       
    L2:                      3072(0xc00) KB                     
    L3:                      98304(0x18000) KB                 
  Chip ID:                 29663(0x73df)                     
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2880                               
  BDFID:                   1536                               
  Internal Node ID:        1                                 
  Compute Unit:            40                                 
  SIMDs per CU:            2                                 
  Shader Engines:          2                                 
  Shader Arrs. per Eng.:   2                                 
  WatchPts on Addr. Ranges:4                                 
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                       
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                       
    y                        1024(0x400)                       
    z                        1024(0x400)                       
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                       
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 115                               
  SDMA engine uCode::      80                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED     
      Size:                    12566528(0xbfc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       FALSE                             
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    12566528(0xbfc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       FALSE                             
    Pool 3                   
      Segment:                 GROUP                             
      Size:                    64(0x40) KB                       
      Allocatable:             FALSE                             
      Alloc Granule:           0KB                               
      Alloc Alignment:         0KB                               
      Accessible by all:       FALSE                             
  ISA Info:               
    ISA 1                   
      Name:                    amdgcn-amd-amdhsa--gfx1031         
      Machine Models:          HSA_MACHINE_MODEL_LARGE           
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                       
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                       
        y                        1024(0x400)                       
        z                        1024(0x400)                       
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

sudo usermod -a -G render yourusername

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j

export HSA_OVERRIDE_GFX_VERSION=10.3.0

# Download the model from huggingface in models directory
https://huggingface.co/TheBloke/openchat-3.5-0106-GGUF

export HSA_OVERRIDE_GFX_VERSION=10.3.0 && export HIP_VISIBLE_DEVICES=1 && sudo ./main -ngl 50 -c 8000 -m models/openchat-3.5-0106-GGUF -p "What are large language models explain with examples and write a sample script using llama.cpp to run openchat model for inference?"

Output of the above is

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 8000, n_batch = 512, n_predict = -1, n_keep = 0

Large Language Models

Large language models (LLMs) are artificial intelligence models that have been trained to understand and generate human-like text. They are called "large" because they typically consist of millions or even billions of parameters, which enable them to learn complex patterns and generate more accurate and coherent responses. Examples of popular LLMs include OpenAI's GPT-3, GPT-4, and Google's BERT.

Components of an LLM

Input Layer: This is where the model receives the input text, which is usually tokenized and converted into numerical representations (e.g., word embeddings).
Encoder: The encoder processes the input tokens and captures the contextual information in the text. For example, it might determine the relationships between words in a sentence or identify the roles of words in a phrase.
Decoder: The decoder generates the output text based on the contextual information captured by the encoder. It predicts one token at a time, using the previously generated tokens as additional input.
Output Layer: This is where the model produces the final output text, typically in the form of a sequence of tokens.

Sample Script using `llama.cpp`

Please note that llama.cpp is not a real library, but I'll provide you with a general outline to help you understand how to load and run a model for inference.

First, you'll need to include the necessary libraries and declare the required functions:

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <map>

// Declare the functions
std::map<std::string, std::string> parse_yaml(const std::string& file_path);
std::vector<std::string> tokenize(const std::string& text);
std::string generate_response(const std::vector<std::string>& tokens);

Next, you can define the main function to load the model and perform inference:

int main() {
    // Load the model parameters from a YAML file
    std::map<std::string, std::string> model_params = parse_yaml("model_params.yaml");

    // Load the OpenChat model
    // Note: You'll need to replace this with the actual code to load your model
    std::string model_path = model_params["model_path"];
    std::string model = load_model(model_path);

    // Get the input text from the user
    std::string input_text;
    std::cout << "Enter your text: ";
    std::getline(std::cin, input_text);

    // Tokenize the input text
    std::vector<std::string> tokens = tokenize(input_text);

    // Generate a response using the OpenChat model
    std::string response = generate_response(tokens);

    // Print the response
    std::cout << "Response: " << response << std::endl;

    return 0;
}

In this example, the parse_yaml, tokenize, and generate_response functions are placeholders. You'll need to replace them with the appropriate code to parse a YAML file, tokenize the input text, and generate a response using the OpenChat model.

Please note that this is just a high-level outline, and you'll need to adapt it to your specific use case and model. If you're working with a specific library or framework, refer to their documentation for detailed instructions on how to load and run a model for inference.

# If you get following error
“hipErrorNoBinaryForGpu: Unable to find code object for all current devices!”

# Try following

sudo tee /etc/apt/sources.list.d/rocm.list <<’EOF’
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/debian jammy main
EOF

sudo amdgpu-install — rocmrelease=5.7.0 — usecase=rocm,hip — no-dkms

Answer 2 · 2024-03-26T09:22:17.000Z

wow thanks for your detailed reply! Appreciated!

Answer 3 · 2024-03-27T06:06:55.000Z

@vikrantrathore Thanks for your detailed answer! BTW, to use the provided openchat server with tensor parallel over multiple GPUs, you can set the tensor parallel argument, e.g.

# N is the number of tensor parallel GPUs
python -m ochat.serving.openai_api_server --model openchat/openchat-3.5-0106 --engine-use-ray --worker-use-ray --tensor-parallel-size N

Output of the above is

Large Language Models

Components of an LLM

Sample Script using llama.cpp

Sample Script using `llama.cpp`