hpcaitech/ColossalAI

Making large AI models cheaper, faster and more accessible

PythonApache-2.0

Issues

[BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1
#6110 opened 2 months ago by cingtiye
17
[FEATURE]: Lora/QLora in GeminiPlugin and TorchFSDP
#6138 opened a month ago by ericxsun
1
[FEATURE]: Is it Possible to integrate Liger-Kernel?
#6047 opened 4 months ago by ericxsun
7
[BUG]: multi-node backward slowdown
#6133 opened a month ago by BurkeHulk
0
[BUG]: Llama3.1-70B-instruct save model
#6108 opened a month ago by cingtiye
14
[BUG]: assert grad_chunk.l2_norm is not None
#6102 opened a month ago by liangzz1991
12
[FEATURE]: support google/gemma-2-2b for Tensor Parallelism
#6120 opened a month ago by jing-4369
0
[BUG]: weird stuck while training
#6095 opened 2 months ago by ericxsun
8
[BUG]: why duplicate PID appears on rank 0
#6111 opened 2 months ago by ericxsun
3
[BUG]: ColossalAI Inference example response empty result without error
#6112 opened 2 months ago by GuangyaoZhang
2
[BUG]: Got nan during backward with zero2
#6091 opened 2 months ago by flymin
9
支持qwen吗
#6027 opened 2 months ago by Storm0921
3
[PROPOSAL]: FP8 with block-wise amax
#6105 opened 2 months ago by Edenzzzz
0
[FEATURE]: Windows wheel needed
#6103 opened 2 months ago by nitinmukesh
0
FasterMoE shadow expert implement
#6076 opened 3 months ago by Guodanding
1
[BUG]: Unable to train on H20 machine
#6079 opened 3 months ago by kaixinbear
1
2024**地区可用机场名单
#6067 opened 3 months ago by swhmy
1
[DOC]: 环境安装失败
#6066 opened 3 months ago by eccct
16
[CUDA] FP8 all-reduce using all-to-all and all-gather
#5996 opened 3 months ago by wangbluo
0
[DOC]: Add document for FP8 training and communication
#6049 opened 3 months ago by GuangyaoZhang
0
[BUG]: Disable all_gather intranode. Disable Redundant all_gather fp8
#6058 opened 3 months ago by GuangyaoZhang
0
[BUG]: Pipeline Parallelism fails when input shape varies
#5940 opened 4 months ago by GuangyaoZhang
1
[BUG]: loading sharded model does not handle `unexpected_keys`
#6019 opened 4 months ago by flymin
0
[BUG]: all_reduce_fp8 is not compatible with torch.compile
#6025 opened 4 months ago by GuangyaoZhang
0
[BUG]: Cannot use CollosalChat
#5986 opened 4 months ago by zawawimanja
2
[BUG]: remove `.github/workflows/submodule.yml`
#6039 opened 4 months ago by BoxiangW
0
[FEATURE]: Support Zerobubble pipeline
#6037 opened 4 months ago by duanjunwen
0
[BUG]: errror Colossalai 0.4.0/0.4.2 /usr/bin/supervisord
#6032 opened 4 months ago by Storm0921
2
[BUG]: AttributeError: 'GeminiDDP' object has no attribute 'module'
#6021 opened 4 months ago by dheerj188
1
如何同时训练两个模型？
#6028 opened 4 months ago by wangqiang9
4
[BUG]: Hang on startup
#5969 opened 4 months ago by rob-hen
4
support moe
#5954 opened 4 months ago by flybird11111
0
[fp8] support async communication
#5999 opened 4 months ago by flybird11111
0
[fp8] support amp
#5974 opened 4 months ago by ver217
2
llama3 pretrian TypeError: launch_from_torch() missing 1 required positional argument: 'config'
#5992 opened 4 months ago by wuduher
1
[fp8] support hybrid parallel plugin
#5972 opened 4 months ago by wangbluo
0
[Feature]: support FP8 communication in Gemini
#5943 opened 4 months ago by BurkeHulk
0
[BUG]: Torch compile causes multi-process to hang with python 3.9
#5987 opened 4 months ago by Edenzzzz
0
[FEATURE]: How to skip a custom node from generating strategies in colossal-auto?
#5983 opened 4 months ago by robotsp
0
llama fp8 forward/backward
#5955 opened 5 months ago by botbw
0
[DOC]: Is there documentation on how to create hostfiles
#5965 opened 4 months ago by zhenbuxianggaimingzi
3
[fp8] support low level zero
#5960 opened 4 months ago by ver217
2
qwen2 fp8 forward/backward
#5971 opened 4 months ago by wangbluo
0
[DOC]: Is there an example of Lora training for Llama3?
#5964 opened 5 months ago by zhurunhua
1
[BUG]: Pytest with a specific config failed after PR #5868
#5949 opened 5 months ago by GuangyaoZhang
0
[FEATURE]: Request updates for pretraining roberta
#5948 opened 5 months ago by jiahuanluo
0
[BUG]: a directory will be maked in each epoch
#5937 opened 5 months ago by zhurunhua
1
[BUG]: _local_rank in DistCoordinator should be int
#5933 opened 5 months ago by flymin
0
[BUG]: RuntimeErrorRuntimeError: : The param bucket max size 12582912 is exceededby tensor (size 131334144)The param bucket max size 12582912 is exceededby tensor (size 131334144)
#5935 opened 5 months ago by zhurunhua
7
[BUG]: UnboundLocalError: cannot access local variable 'default_conversation' where it is not associated with a value
#5930 opened 5 months ago by zhurunhua
3