databrickslabs/dolly

Only one GPU is working hard, while the other GPUs are idle

Valdanitooooo opened this issue · 1 comments

  • I am using EC2 p3.8xlarge:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   38C    P0    64W / 300W |  15741MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   36C    P0    63W / 300W |    725MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   39C    P0    69W / 300W |    725MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   38C    P0    65W / 300W |    725MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     32377      C   python                          15738MiB |
|    1   N/A  N/A     32377      C   python                            722MiB |
|    2   N/A  N/A     32377      C   python                            722MiB |
|    3   N/A  N/A     32377      C   python                            722MiB |
+-----------------------------------------------------------------------------+
  • Here is my code:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

def custom_llm():
    tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b", padding_side="left")
    base_model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="auto", torch_dtype=torch.bfloat16)
    # base_model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="sequential", torch_dtype=torch.bfloat16)

    pipe = pipeline(
        "text-generation",
        model=base_model,
        tokenizer=tokenizer,
        max_length=128,
        temperature=0.05,
        pad_token_id=tokenizer.eos_token_id,
        top_p=0.95,
        repetition_penalty=1.2
    )
    global local_llm
    local_llm = HuggingFacePipeline(pipeline=pipe)

When max_length reaches 512, I will get an OOM error
Any advice?

You've set device_map="auto". Look at how it has assigned the layers with base_model.hf_device_map. Did it assign to all GPUs? From your output, seems like it almost all loaded on the first GPU. Try balanced, or a custom assignment. This is an HF question, not really about this model.