Issues
- 0
Failed to run Domino example
#940 opened by lucifer1004 - 4
How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost
#936 opened by lovedoubledan - 1
RuntimeError: CUDA error: no kernel image is available for execution on the device
#935 opened by mrpeerat - 6
- 1
- 2
No module named 'transformers.deepspeed'
#934 opened by TianyuJIAA - 0
- 0
After using steps 1, 2, and 3, the test reply content only replies Assistant: </s>。
#928 opened by jianmomo - 0
How to calculate training efficiency ,i.e tokens/sec of step 1 fine tuning of llama2 model ?
#923 opened by sowmya04101998 - 1
Actor loss nan and Resizing model embedding
#922 opened by ouyanmei - 0
- 2
How to start deepspeed automatically?
#910 opened by qwerfdsadad - 3
zero3 and enable hybrid engine are not suitable for llama2, how to solve it?
#864 opened by terence1023 - 1
The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled
#887 opened by mousewu - 0
step2 without any response for a long time
#915 opened by asfadfaf - 2
Consult the first phase.
#909 opened by csxrzhang - 1
单机多卡进行RLHF在第三步中使用Qwen模型作Actor Model报错
#907 opened by Dakai798 - 11
run-example.sh fails with urllib3.exceptions.ProtocolError: Response ended prematurely
#896 opened by awan-10 - 0
Different zero stage the training memory compute
#912 opened by Arcmoon-Hu - 1
nvcc fatal : Unsupported gpu architecture 'compute_86' and nvcc fatal : Value 'c++17' is not defined for option 'std'
#911 opened by Xccanxin - 0
- 0
DeepSpeed-Chat step-1 hanging for a long time
#906 opened by lemon-little - 0
CPU OOM when inferencing Llama3-70B-Chinese-Chat
#904 opened by GORGEOUSLCX - 1
Confusion about Deepspeed Inference
#879 opened by ZekaiGalaxy - 0
cannot pickle 'Stream' object
#903 opened by teis-e - 0
- 0
请问fastgen 是否支持长文本和序列并行推理
#901 opened by AceCoder0 - 0
[Error] AutoTune: `connect to host localhost port 22: Connection refused`
#894 opened by wqw547243068 - 0
- 11
Does Zero-Inference support TP?
#892 opened by preminstrel - 1
Deepspeed support finetune extra model with lora ?
#890 opened by wanghongqu - 0
- 0
About multiple-thread attention computation on CPU using zero-inference example.
#886 opened by luckyq - 0
Suggested GPU to run the demo code of step2_reward_model_finetuning (DeepSpeed-Chat)
#885 opened by wenbozhangjs - 0
- 1
RLHF problems when using Qwen model
#861 opened by 128Ghe980 - 1
The reward value did not increase.
#883 opened by Sun-Shiqi - 0
`AttributeError: readonly attribute` while trying to run training/HelloDeepSpeed
#878 opened by htjain - 2
[BUG in Stable Diffusion inference] There's an error on CUDAGraph when using deepspeed inference. How to fix it?
#866 opened by foin6 - 3
- 0
Codellama finetune
#860 opened by nani1149 - 0
Throughput should be `num_queries/latency` as opposed to `num_clients/latency`?
#858 opened by goelayu - 1
The inaccurate flop results after several rounds
#855 opened by BitCalSaul - 0
How to resume Deepspeed-Chat RLHF step-3 training?
#850 opened by DespairL - 0
remove redundant code
#852 opened by ilml - 0
Question: Why not padding to the same sequence length within the batch during the sft training phase?
#849 opened by LKLKyy - 0
running gpt2-xl/test_tune.sh fails - ParquetConfig.__init__() got an unexpected keyword argument 'token'
#847 opened by ccruttjr - 3
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6
#845 opened by Rainbowman0 - 1
Step3 hanging for a long time
#842 opened by Jeayea - 0