ray-project/ray_lightning

Pytorch Lightning Distributed Accelerators using Ray

PythonApache-2.0

Pinned issues

Deprecation Notice: ray_lightning to be Replaced with New LightningTrainer in Ray 2.4

#258 opened 2 years ago by woshiyyya

Open1

Issues

Deprecation Notice: ray_lightning to be Replaced with New LightningTrainer in Ray 2.4
#258 opened 2 years ago by woshiyyya
1
Trails did not complete error
#257 opened 2 years ago by Bk073
0
ray-lightning doesn't seem to be compatible with ray 2.3
#255 opened 2 years ago by marcosrdac
2
Trials hang when using a scheduler
#253 opened 2 years ago by dcfidalgo
0
What happens with custom samplers?
#252 opened 2 years ago by AugustoPeres
0
RuntimeError: Error(s) in loading state_dict: Unexpected key(s) when recovering results from main process during Trainer.fit()
#246 opened 2 years ago by davzaman
0
TPU support?
#245 opened 2 years ago by platers
0
[Question] Is it necessary to adapt report and Checkpointing to the newly introduced session and Checkpoint API of Ray AIR?
#243 opened 2 years ago by MarkusSpanring
0
Error when using WandbLogger
#205 opened 2 years ago by KwanWaiChung
1
Multi-GPU training fails with `ValueError` on systems with UUID GPU IDs
#236 opened 2 years ago by davidegraff
1
TypeError in a SLURM environment due to internal API break
#235 opened 2 years ago by dcfidalgo
2
[Code] add pytorch-lightning compatibility for 1.7.x
#194 opened 2 years ago by JiahaoYao
6
Rank Zero Deprecation
#230 opened 2 years ago by lcaquot94
1
Can not checkpoint and log
#228 opened 2 years ago by lcaquot94
1
Ray lightning opens a new mlflow run
#225 opened 2 years ago by AugustoPeres
0
TuneReportCheckpointCallback error
#219 opened 2 years ago by jakubMitura14
2
population based scheduler error
#220 opened 2 years ago by jakubMitura14
0
no training starts although flag is running
#216 opened 2 years ago by jakubMitura14
5
Teardown after trainer.fit() takes exceptionally long when using RayStrategy with large models
#207 opened 2 years ago by MarkusSpanring
1
Deterministic mode is not set on remote worker when `Trainer` is set to `deterministic`
#213 opened 2 years ago by MarkusSpanring
0
Question: Why use ray_lightning instead of pytorch_lightning for multi-node training?
#212 opened 2 years ago by saryazdi
4
Worker nodes don't start for ray-lightning & aws
#210 opened 2 years ago by toru34
1
[Code] best_model_path in ModelCheckpointCallback (rank 0 and driver node)
#202 opened 2 years ago by chongxiaoc
0
Error in RayStrategy.root_device when using multi GPU node
#192 opened 2 years ago by m-lyon
20
adding the version in `__init__`
#191 opened 2 years ago by JiahaoYao
0
AttributeError: 'AcceleratorConnector' object has no attribute 'strategy'
#189 opened 2 years ago by m-lyon
6
Distributed training performance slowdown when resuming from a checkpoint.
#184 opened 2 years ago by subhashbylaiah
5
`ray_lightning` checkpoint dir not saving the checkpoint
#186 opened 2 years ago by JiahaoYao
0
`ray_horovod` leaks gpu memory on the `cuda:0`
#181 opened 2 years ago by JiahaoYao
2
`ray_horovod` multi pid process in the `run`
#182 opened 2 years ago by JiahaoYao
2
`ray_ddp` issue of `Leaking Caffe2 thread-pool after fork. (function pthreadpool)`
#180 opened 2 years ago by JiahaoYao
0
`ray_ddp` gpu issue
#179 opened 2 years ago by JiahaoYao
3
`ray_ddp` global and local rank
#175 opened 2 years ago by JiahaoYao
1
tune test: do we need to count the head node cpu?
#178 opened 2 years ago by JiahaoYao
1
`ray_ddp` showing no use of gpu
#177 opened 2 years ago by JiahaoYao
2
`ray_ddp` the progressive bar is broken
#176 opened 2 years ago by JiahaoYao
1
ray ddp fails with 2 gpu workers
#174 opened 2 years ago by JiahaoYao
10
`shard-ddp` test of system exit
#173 opened 2 years ago by JiahaoYao
0
warning in the ci test (change the deprecated api)
#172 opened 2 years ago by JiahaoYao
1
torch remove the checkpoint when `is_global_zero` is not set? (multi-worker setting)
#171 opened 2 years ago by JiahaoYao
0
log is changed in the new version of pytorch lightning
#170 opened 2 years ago by JiahaoYao
1
change the `checkpoint_callback=True`
#169 opened 2 years ago by JiahaoYao
0
warning from the horovod trainer
#168 opened 2 years ago by JiahaoYao
3
horovod lightning integration missing the log dir
#167 opened 2 years ago by JiahaoYao
0
horovod installation issue
#165 opened 2 years ago by JiahaoYao
1
trainer is not consistent during the `ray_ddp`
#160 opened 2 years ago by JiahaoYao
1
the training results can be pulled to the main process
#162 opened 2 years ago by JiahaoYao
0
[raystrategy] multi-stragy in the worker is not consistent
#161 opened 2 years ago by JiahaoYao
2
Using LightningCLI to parse plugin options from the config file fails when using the RayPlugin.
#151 opened 3 years ago by subhashbylaiah
1
Multiple WandB experiments created with PyTorch lightning DDP
#155 opened 3 years ago by brodyh
0