wangbluo
Previous worked in ByteDance, Temu. Master of NUS, bachelor of SYSU. Focus on parallel training in LLMs.
colossalaiSingapore
Pinned Repositories
ColossalAI
Making large AI models cheaper, faster and more accessible
pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
BandWidth_Test
Test the GPU bandwidth of collectives operators like all-reduce, all-gather, broadcast and all-to-all primitives on single-node multi-GPU (2, 4, 8 cards) and multi-node multi-GPU (16 cards) setups, using only PyTorch and Python built-in packages.
ColossalAI
Making large AI models cheaper, faster and more accessible
Finetune_llama2
Build a llama fine-tuning script from scratch using PyTorch and transformers API. It needs to support 4 optional features: gradient checkpointing, mixed precision, data parallelism, tensor parallelism. Do not use ColossalAI/Megatron/DeepSpeed frameworks, you can refer to the code.
Finetune_llama2_Megatron
Using megatron style to do TP training.
Pytorch-profile
Use pytorch profile api to further analysis the training detailed information, like heaps and stacks, time consuming.
wangbluo's Repositories
wangbluo/Finetune_llama2
Build a llama fine-tuning script from scratch using PyTorch and transformers API. It needs to support 4 optional features: gradient checkpointing, mixed precision, data parallelism, tensor parallelism. Do not use ColossalAI/Megatron/DeepSpeed frameworks, you can refer to the code.
wangbluo/Finetune_llama2_Megatron
Using megatron style to do TP training.
wangbluo/BandWidth_Test
Test the GPU bandwidth of collectives operators like all-reduce, all-gather, broadcast and all-to-all primitives on single-node multi-GPU (2, 4, 8 cards) and multi-node multi-GPU (16 cards) setups, using only PyTorch and Python built-in packages.
wangbluo/ColossalAI
Making large AI models cheaper, faster and more accessible
wangbluo/Pytorch-profile
Use pytorch profile api to further analysis the training detailed information, like heaps and stacks, time consuming.