microsoft/DeepSpeedExamples

Example models using DeepSpeed

PythonApache-2.0

Issues

Assertion `srcIndex < srcSelectDimSize` failed
#946 opened 2 months ago by boqiny
1
Failed to run Domino example
#940 opened 2 months ago by lucifer1004
2
单机多卡进行RLHF在第三步中使用Qwen模型作Actor Model报错
#907 opened 8 months ago by Dakai798
2
Question to attention computation
#944 opened 2 months ago by yuzhenmao
0
KV_cache offload
#943 opened 2 months ago by yuzhenmao
0
A bug in argument parser.
#941 opened 2 months ago by ChenDaiwei-99
0
How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost
#936 opened 3 months ago by lovedoubledan
4
RuntimeError: CUDA error: no kernel image is available for execution on the device
#935 opened 3 months ago by mrpeerat
1
FileNotFoundError: [Errno 2] No such file or directory: 'numactl'
#920 opened 3 months ago by zhiwentian
6
AttributeError： 'DeepSpeedEngine' object has no attribute 'model'，
#924 opened 4 months ago by lovychen
1
No module named 'transformers.deepspeed'
#934 opened 4 months ago by TianyuJIAA
2
Does DeepSpeed's Pipeline-Parallelism optimizer supports skip connections?
#932 opened 4 months ago by RoyMahlab
0
After using steps 1, 2, and 3, the test reply content only replies Assistant: </s>。
#928 opened 5 months ago by jianmomo
0
How to calculate training efficiency ,i.e tokens/sec of step 1 fine tuning of llama2 model ?
#923 opened 6 months ago by sowmya04101998
0
Actor loss nan and Resizing model embedding
#922 opened 6 months ago by ouyanmei
1
不同机器上python环境变量路径不同，deepspeed启动后发现找不到其他机器的python环境，如何解决
#889 opened 6 months ago by liqwertyu
0
How to start deepspeed automatically?
#910 opened 6 months ago by qwerfdsadad
2
zero3 and enable hybrid engine are not suitable for llama2, how to solve it?
#864 opened a year ago by terence1023
3
The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled
#887 opened 10 months ago by mousewu
1
step2 without any response for a long time
#915 opened 6 months ago by asfadfaf
0
Consult the first phase.
#909 opened 6 months ago by csxrzhang
2
run-example.sh fails with urllib3.exceptions.ProtocolError: Response ended prematurely
#896 opened 7 months ago by awan-10
11
Different zero stage the training memory compute
#912 opened 7 months ago by Arcmoon-Hu
0
nvcc fatal : Unsupported gpu architecture 'compute_86' and nvcc fatal : Value 'c++17' is not defined for option 'std'
#911 opened 7 months ago by Xccanxin
1
an error with gradient checkpointing in DeepspeedChat step 3
#908 opened 8 months ago by wangyuwen1999
0
DeepSpeed-Chat step-1 hanging for a long time
#906 opened 8 months ago by lemon-little
0
CPU OOM when inferencing Llama3-70B-Chinese-Chat
#904 opened 9 months ago by GORGEOUSLCX
0
Confusion about Deepspeed Inference
#879 opened a year ago by ZekaiGalaxy
1
cannot pickle 'Stream' object
#903 opened 9 months ago by teis-e
0
can not run the test-gpt.sh because of assertionError
#902 opened 9 months ago by leachee99
0
请问fastgen 是否支持长文本和序列并行推理
#901 opened 9 months ago by AceCoder0
0
[Error] AutoTune: `connect to host localhost port 22: Connection refused`
#894 opened 10 months ago by wqw547243068
0
How to use deepspeed for multi-node and multi-card task in slurm cluster
#893 opened 10 months ago by dshwei
0
Does Zero-Inference support TP?
#892 opened 10 months ago by preminstrel
11
Deepspeed support finetune extra model with lora ?
#890 opened 10 months ago by wanghongqu
1
when calculating actor loss, why the mask is "action_mask[:, start: ] "
#888 opened 10 months ago by fancghit
0
About multiple-thread attention computation on CPU using zero-inference example.
#886 opened 10 months ago by luckyq
0
Suggested GPU to run the demo code of step2_reward_model_finetuning (DeepSpeed-Chat)
#885 opened 10 months ago by wenbozhangjs
0
[REQUEST] More fine-grained distributed strategies for RLHF training
#884 opened 10 months ago by krisx0101
0
RLHF problems when using Qwen model
#861 opened a year ago by 128Ghe980
1
The reward value did not increase.
#883 opened 10 months ago by Sun-Shiqi
1
`AttributeError: readonly attribute` while trying to run training/HelloDeepSpeed
#878 opened a year ago by htjain
0
[BUG in Stable Diffusion inference] There's an error on CUDAGraph when using deepspeed inference. How to fix it?
#866 opened a year ago by foin6
2
[Bug] DeepSpeed Inference Does not Work with LLaMA (Latest verison)
#867 opened a year ago by allanj
3
Codellama finetune
#860 opened a year ago by nani1149
0
Throughput should be `num_queries/latency` as opposed to `num_clients/latency`?
#858 opened a year ago by goelayu
0
The inaccurate flop results after several rounds
#855 opened a year ago by BitCalSaul
1
How to resume Deepspeed-Chat RLHF step-3 training?
#850 opened a year ago by DespairL
0
remove redundant code
#852 opened a year ago by ilml
0
Question: Why not padding to the same sequence length within the batch during the sft training phase?
#849 opened a year ago by LKLKyy
0