Lightning-AI/pytorch-lightning
Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
PythonApache-2.0
Issues
- 1
Stream outputs from Trainer.predict()
#20334 opened by Turakar - 3
- 2
Mid-epoch resume causes a single unwanted validation step (which is not a sanity check)
#20288 opened by Youyoun - 3
Validation is incorrectly run on resume
#20277 opened by PiotrDabkowski - 0
Incorrect URI Prefix Stripping in MLflowLogger
#20279 opened by awindmann - 0
- 2
- 4
- 4
- 1
Model Checkpointing + FSDP causes Cuda OOM
#20312 opened by profPlum - 2
Import error on shutdown/KeyboardInterrupt if ran from Jupyter Lab notebook cell
#20317 opened by asigalov61 - 1
Can't resume automatically a job, ckpt_path="hpc" throws ValueError from the start
#20347 opened by F-Barto - 6
Everything prints fine, but the loss doesn't descent
#20344 opened by 2catycm - 2
- 0
Gradient accumulation calcluation may be incorrect
#20350 opened by tyler-rt - 0
Add support S3 as a storage option for profiling results
#20348 opened by kimminw00 - 0
tensorboard step and self.global_step do not correspond under accumulate_grad
#20346 opened by wuzhiyue111 - 2
DDP and BackboneFinetuning: model weights get out of sync when unfreezing layers for training
#20340 opened by ksikka - 0
Impove how argument passing via CLI and config file is handled in regards to argument linking
#20341 opened by MrWhatZitToYaa - 3
Unreadable font color theme of YAML files
#20335 opened by MrWhatZitToYaa - 0
PyTorchProfiler: not showing CPU memory used even with `profile_memory=True`
#20339 opened by Jack12xl - 0
restore_training_state before on_fit_start?
#20338 opened by lampuiho - 1
`Trainer`'s `.init_module()` context does not initialize model on target device
#20307 opened by jin-zhe - 0
Add a Chinese version of README
#20332 opened by nocoding03 - 1
Deepspeed Startegy doesn't set num_checkpoints while using activation partitions
#20329 opened by Gforky - 3
RuntimeError when running basic GAN model (from tutorial at lightning.ai) with DDP
#20328 opened by pranavrao-qure - 1
- 1
Bad practice in GAN example
#20331 opened by MrWhatZitToYaa - 1
Support A Variable Number of Batches
#20330 opened by e-yi - 0
Add list to torch.Tensor injection in yaml config
#20324 opened by fguiotte - 0
best-k-metrics in ModelCheckpoint
#20321 opened by gonzachiar - 2
Unable to serialize WandbLogger
#20315 opened by cwallenwein - 5
Incosistant memory usage comparing to huggingface trainer when using deepspeed
#20299 opened by mickeysun0104 - 0
`hparams` not loaded when loading checkpoint via LightningCLI
#20310 opened by YouRik - 3
Split `reload_dataloaders_every_n_epochs` into separate parameters for train, val and test dataloaders
#20309 opened by windring - 1
`NeptuneCallback` produces lots of `X-coordinates (step) must be strictly increasing` errors
#20281 opened by iirekm - 1
- 0
NCCL backend fails during multi-node, multi-GPU training
#20306 opened by raketenolli - 0
the example that shows "The LightningModule also has access to the Hyperparameters" is not correct
#20303 opened by XinleiRen - 2
Problem in multi-gpu training
#20264 opened by xizaoqu - 0
- 0
- 0
Fabric does not sync gradients?
#20293 opened by RuABraun - 2
- 0
WandbLogger will cause error on TPU v3-8
#20278 opened by buoyancy99 - 1
Lightning place model inputs and model to different devices
#20276 opened by Kamichanw - 0
MLFlow logger returns None when MLFlow server is used
#20273 opened by lilruwu - 1
Custom batch sampler fails to re-instantiate in `_dataloader_init_kwargs_resolve_sampler`
#20272 opened by Kamichanw - 0
rich progress bar shows v_num as 0.000
#20268 opened by npuichigo - 0
`_update_dataloader` improperly copies state of subclassed dataloader with attribute names that differ from `__init__` parameters.
#20265 opened by spenceforce