[Question]: qwen推理显存不足,如何设置多卡推理
zhaogf01 opened this issue · 4 comments
zhaogf01 commented
请提出你的问题
这是我的推理的代码,请问如何多卡推理?
from paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("qwen/qwen-7b")
model = AutoModelForCausalLM.from_pretrained("qwen/qwen-7b", dtype="float32")
input_features = tokenizer("hello", return_tensors="pd")
outputs = model.generate(**input_features, max_length=128)
tokenizer.batch_decode(outputs[0])
w5688414 commented
可以使用recompute和flash attention。如果要多卡推理,需要对参数进行一定的修改,可以参考:
然后执行:
python -m paddle.distributed.launch --gpus "0,1,2,3" your_script.py
zhaogf01 commented
w5688414 commented
可以参考下面:
PaddleNLP/llm/glm/predict_generation.py
Line 70 in 1ffa290
zhaogf01 commented
可以参考下面:
PaddleNLP/llm/glm/predict_generation.py
Line 70 in 1ffa290
好的,感谢。