Issues
- 8
[BUG] DeepSpeed-Chat Step3 - actor model repeats generating the same token when hybrid engine enabled
#821 opened by GeekDream-x - 2
zero3 and enable hybrid engine are not suitable for llama2, how to solve it?
#864 opened by terence1023 - 0
CPU OOM when inferencing Llama3-70B-Chinese-Chat
#904 opened by GORGEOUSLCX - 1
Confusion about Deepspeed Inference
#879 opened by ZekaiGalaxy - 0
cannot pickle 'Stream' object
#903 opened by teis-e - 0
- 0
请问fastgen 是否支持长文本和序列并行推理
#901 opened by AceCoder0 - 8
run-example.sh fails with urllib3.exceptions.ProtocolError: Response ended prematurely
#896 opened by awan-10 - 0
[Error] AutoTune: `connect to host localhost port 22: Connection refused`
#894 opened by wqw547243068 - 0
- 11
Does Zero-Inference support TP?
#892 opened by preminstrel - 1
Deepspeed support finetune extra model with lora ?
#890 opened by wanghongqu - 0
- 0
- 0
The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled
#887 opened by mousewu - 0
About multiple-thread attention computation on CPU using zero-inference example.
#886 opened by luckyq - 0
Suggested GPU to run the demo code of step2_reward_model_finetuning (DeepSpeed-Chat)
#885 opened by wenbozhangjs - 0
- 1
RLHF problems when using Qwen model
#861 opened by 128Ghe980 - 1
The reward value did not increase.
#883 opened by Sun-Shiqi - 0
`AttributeError: readonly attribute` while trying to run training/HelloDeepSpeed
#878 opened by htjain - 0
Benchmark mii stalled and crashed
#877 opened by Albert-Zhao-2020 - 2
[BUG in Stable Diffusion inference] There's an error on CUDAGraph when using deepspeed inference. How to fix it?
#866 opened by foin6 - 3
- 0
Codellama finetune
#860 opened by nani1149 - 0
Throughput should be `num_queries/latency` as opposed to `num_clients/latency`?
#858 opened by goelayu - 1
The inaccurate flop results after several rounds
#855 opened by BitCalSaul - 0
How to resume Deepspeed-Chat RLHF step-3 training?
#850 opened by DespairL - 0
remove redundant code
#852 opened by ilml - 2
Why is the shape of rm model all 0
#820 opened by Pattaro - 0
Question: Why not padding to the same sequence length within the batch during the sft training phase?
#849 opened by LKLKyy - 0
running gpt2-xl/test_tune.sh fails - ParquetConfig.__init__() got an unexpected keyword argument 'token'
#847 opened by ccruttjr - 3
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6
#845 opened by Rainbowman0 - 1
async_pipeline is not exposed in the library
#835 opened by yaliqin - 1
Step3 hanging for a long time
#842 opened by Jeayea - 0
- 1
- 1
Step3 PPO print error when enable --print_answers
#836 opened by tonylin52 - 1
step3 use same memory when I increase GPUs
#817 opened by Little-rookie-ee - 0
- 0
Mistral and Orca Training
#832 opened by syngokhan - 1
Llama2 as actor using zero_stage3
#814 opened by George-Chia - 1
- 0
运行e2e_rlhf时报错
#829 opened by Sun-9923 - 0
deeepspeed chat 支持pipline 并行吗?
#825 opened by mollon650 - 0
- 0
Should it use global_rank as the condition for shared-disk?
#822 opened by sz128 - 0
DeepSpeed-VisualChat Tensor shape mismatch
#818 opened by Linjiahua - 0
Does the DeepSpeedVisualChat model have the capability to locate targets, such as generating coordinates for bounding box positions?
#816 opened by Watebear - 1
DeepSpeed-Chat Step-1 training error
#813 opened by yifan-bao