Only one GPU is working hard, while the other GPUs are idle
Valdanitooooo opened this issue · 1 comments
Valdanitooooo commented
- I am using EC2 p3.8xlarge:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 38C P0 64W / 300W | 15741MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 36C P0 63W / 300W | 725MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 39C P0 69W / 300W | 725MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 38C P0 65W / 300W | 725MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 32377 C python 15738MiB |
| 1 N/A N/A 32377 C python 722MiB |
| 2 N/A N/A 32377 C python 722MiB |
| 3 N/A N/A 32377 C python 722MiB |
+-----------------------------------------------------------------------------+
- Here is my code:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
def custom_llm():
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b", padding_side="left")
base_model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="auto", torch_dtype=torch.bfloat16)
# base_model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="sequential", torch_dtype=torch.bfloat16)
pipe = pipeline(
"text-generation",
model=base_model,
tokenizer=tokenizer,
max_length=128,
temperature=0.05,
pad_token_id=tokenizer.eos_token_id,
top_p=0.95,
repetition_penalty=1.2
)
global local_llm
local_llm = HuggingFacePipeline(pipeline=pipe)
When max_length reaches 512, I will get an OOM error
Any advice?
srowen commented
You've set device_map="auto". Look at how it has assigned the layers with base_model.hf_device_map. Did it assign to all GPUs? From your output, seems like it almost all loaded on the first GPU. Try balanced, or a custom assignment. This is an HF question, not really about this model.