Pinned issues
Issues
- 0
## :pill: CI failures summary and remediations
#1295 opened by LiLu2312 - 1
Build error on cpp/custom-dataset
#1255 opened by dribllerrad - 0
mnist freezes on test with ROCM
#1292 opened by jlo62 - 1
Does torchrun + FSDP create multiple copies of the same dataset and model?
#1289 opened by tsengalb99 - 10
[DOC] Update mnist.py example
#1270 opened by orion160 - 2
Multinode.py example fails
#1279 opened by rohan-mehta-1024 - 0
Larger image size for DCGAN code with Celeba dataset
#1278 opened by mahmoodn - 1
About the problem of multi-node running stuck
#1185 opened by AntyRia - 1
- 6
FSDP T5 Example not working
#1210 opened by YooSungHyun - 1
SequenceParallel sharding seems wrong
#1271 opened by marib00 - 2
resume train
#1194 opened by hefangnan - 1
reference of weight initialization for llama2 model
#1264 opened by SeunghyunSEO - 0
`local_rank` or `rank` for multi-node FSDP
#1263 opened by Emerald01 - 4
vision-transformer problem report
#1184 opened by ChenDaiwei-99 - 0
multi-node Tensor Parallel
#1257 opened by PieterZanders - 10
Drawbacks of making the C++ API look like Python
#1253 opened by dannypike - 1
RuntimeError: HIP error when running ResNet-50 on PRO W7900 with PyTorch
#1249 opened by liangyong928 - 0
- 1
`word_language_model` Different masking operation in two official tutorials
#1170 opened by ShengYun-Peng - 1
RuntimeError in Partialconv-master
#1241 opened by shaSaaliha - 0
Pytorch is insufficiently opinionated
#1242 opened - 1
Segmentation fault (core dumped) at `model(images)` for examples/imagenet/main.py
#1238 opened by MaoZiming - 1
Long training time for ResNet50 on ImageNet-1k
#1236 opened by iamsh4shank - 0
Testing a C++ case with MPI failed.
#1235 opened by alamj - 3
SSL Error When downloading dataset
#1233 opened by junpuf - 10
If I am training on a SINGLE GPU, should this "--dist-backend 'gloo'" argument be added to the command?
#1229 opened by HassanBinHaroon - 0
word_language_model/data.py - remove '<eos>'
#1228 opened by drtonyr - 0
word_language_model/data.py - two areas of redundant code
#1227 opened by drtonyr - 1
The doc build deployment has been failing since jan
#1218 opened by lancerts - 0
Daily CI failed
#1211 opened by github-actions - 0
RL Examples had bugs on current gym version
#1213 opened by sanggusti - 0
add examples/siamese_network with triplet loss example
#1208 opened by pax7 - 1
- 3
no any output when I try ddp
#1154 opened by YihuaXuCn - 0
world_language_model example throws UnicodeEncodeError
#1202 opened by miebster - 0
multi-node DDP
#1200 opened by Tabatabaei1999 - 0
Can not launch DDP training using distributed/ddp-tutorial-series/multigpu.py
#1199 opened by 480284856 - 0
add scaler.unscale_(optimizer) before clip_grad_norm_
#1196 opened by nickyi1990 - 0
Build failing on C++ 20 - M2 MacOS
#1195 opened by sjoptra - 0
main.py: TensorBoard in case of Multi-processing Distributed Data Parallel Training
#1190 opened by jecampagne - 0
Daily CI failed
#1183 opened by github-actions - 0
Add `save_model` arg to `mnist_hogwild` example
#1188 opened by pranavvp16 - 1
Argument parser does not recognise mps
#1182 opened by rociorey - 1
compile errors in cpp code
#1180 opened by AntonyM55 - 0
How to load Transformer model once using FSDP
#1179 opened by ToddMorrill - 0
- 2
- 0
- 1
Daily CI failed
#1171 opened by github-actions