Issues
- 17
- 1
[FEATURE]: Lora/QLora in GeminiPlugin and TorchFSDP
#6138 opened by ericxsun - 7
[FEATURE]: Is it Possible to integrate Liger-Kernel?
#6047 opened by ericxsun - 0
[BUG]: multi-node backward slowdown
#6133 opened by BurkeHulk - 14
[BUG]: Llama3.1-70B-instruct save model
#6108 opened by cingtiye - 12
[BUG]: assert grad_chunk.l2_norm is not None
#6102 opened by liangzz1991 - 0
- 8
[BUG]: weird stuck while training
#6095 opened by ericxsun - 3
[BUG]: why duplicate PID appears on rank 0
#6111 opened by ericxsun - 2
[BUG]: ColossalAI Inference example response empty result without error
#6112 opened by GuangyaoZhang - 9
[BUG]: Got nan during backward with zero2
#6091 opened by flymin - 3
- 0
[PROPOSAL]: FP8 with block-wise amax
#6105 opened by Edenzzzz - 0
[FEATURE]: Windows wheel needed
#6103 opened by nitinmukesh - 1
FasterMoE shadow expert implement
#6076 opened by Guodanding - 1
[BUG]: Unable to train on H20 machine
#6079 opened by kaixinbear - 1
2024**地区可用机场名单
#6067 opened by swhmy - 16
[DOC]: 环境安装失败
#6066 opened by eccct - 0
[CUDA] FP8 all-reduce using all-to-all and all-gather
#5996 opened by wangbluo - 0
- 0
- 1
- 0
- 0
- 2
[BUG]: Cannot use CollosalChat
#5986 opened by zawawimanja - 0
[BUG]: remove `.github/workflows/submodule.yml`
#6039 opened by BoxiangW - 0
[FEATURE]: Support Zerobubble pipeline
#6037 opened by duanjunwen - 2
- 1
- 4
如何同时训练两个模型?
#6028 opened by wangqiang9 - 4
[BUG]: Hang on startup
#5969 opened by rob-hen - 0
support moe
#5954 opened by flybird11111 - 0
[fp8] support async communication
#5999 opened by flybird11111 - 2
[fp8] support amp
#5974 opened by ver217 - 1
llama3 pretrian TypeError: launch_from_torch() missing 1 required positional argument: 'config'
#5992 opened by wuduher - 0
[fp8] support hybrid parallel plugin
#5972 opened by wangbluo - 0
[Feature]: support FP8 communication in Gemini
#5943 opened by BurkeHulk - 0
- 0
[FEATURE]: How to skip a custom node from generating strategies in colossal-auto?
#5983 opened by robotsp - 0
llama fp8 forward/backward
#5955 opened by botbw - 3
- 2
[fp8] support low level zero
#5960 opened by ver217 - 0
qwen2 fp8 forward/backward
#5971 opened by wangbluo - 1
[DOC]: Is there an example of Lora training for Llama3?
#5964 opened by zhurunhua - 0
- 0
[FEATURE]: Request updates for pretraining roberta
#5948 opened by jiahuanluo - 1
[BUG]: a directory will be maked in each epoch
#5937 opened by zhurunhua - 0
[BUG]: _local_rank in DistCoordinator should be int
#5933 opened by flymin - 7
[BUG]: RuntimeErrorRuntimeError: : The param bucket max size 12582912 is exceededby tensor (size 131334144)The param bucket max size 12582912 is exceededby tensor (size 131334144)
#5935 opened by zhurunhua - 3
[BUG]: UnboundLocalError: cannot access local variable 'default_conversation' where it is not associated with a value
#5930 opened by zhurunhua