Lightning-AI/pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.

PythonApache-2.0

Issues

Stream outputs from Trainer.predict()
#20334 opened 3 months ago by Turakar
1
_atomic_save with transaction cause "Invalid cross-device link" error
#20270 opened 4 months ago by RichardChe
3
Mid-epoch resume causes a single unwanted validation step (which is not a sanity check)
#20288 opened 3 months ago by Youyoun
2
Validation is incorrectly run on resume
#20277 opened 3 months ago by PiotrDabkowski
3
Incorrect URI Prefix Stripping in MLflowLogger
#20279 opened a month ago by awindmann
0
Custom Pytorch BatchSampler does not work well with pytorch lightning
#20326 opened a month ago by dadwadw233
0
SLURM resubmission crashes because of multiprocessing error
#20280 opened 3 months ago by antonzub99
2
`LightningCLI` doesn't fail when `config.yaml` contains invalid arguments
#20337 opened 2 months ago by adosar
4
Save save_hyperparameters no longer respects linked arguments.
#20311 opened 3 months ago by Erotemic
4
Model Checkpointing + FSDP causes Cuda OOM
#20312 opened 3 months ago by profPlum
1
Import error on shutdown/KeyboardInterrupt if ran from Jupyter Lab notebook cell
#20317 opened 3 months ago by asigalov61
2
Can't resume automatically a job, ckpt_path="hpc" throws ValueError from the start
#20347 opened 2 months ago by F-Barto
1
Everything prints fine, but the loss doesn't descent
#20344 opened 2 months ago by 2catycm
6
LearningRateFinder creates errors for schedulers in `val` stage
#20355 opened 2 months ago by DeanLa
2
Gradient accumulation calcluation may be incorrect
#20350 opened 2 months ago by tyler-rt
0
Add support S3 as a storage option for profiling results
#20348 opened 2 months ago by kimminw00
0
tensorboard step and self.global_step do not correspond under accumulate_grad
#20346 opened 2 months ago by wuzhiyue111
0
DDP and BackboneFinetuning: model weights get out of sync when unfreezing layers for training
#20340 opened 2 months ago by ksikka
2
Impove how argument passing via CLI and config file is handled in regards to argument linking
#20341 opened 2 months ago by MrWhatZitToYaa
0
Unreadable font color theme of YAML files
#20335 opened 3 months ago by MrWhatZitToYaa
3
PyTorchProfiler: not showing CPU memory used even with `profile_memory=True`
#20339 opened 2 months ago by Jack12xl
0
restore_training_state before on_fit_start?
#20338 opened 2 months ago by lampuiho
0
`Trainer`'s `.init_module()` context does not initialize model on target device
#20307 opened 3 months ago by jin-zhe
1
Add a Chinese version of README
#20332 opened 3 months ago by nocoding03
0
Deepspeed Startegy doesn't set num_checkpoints while using activation partitions
#20329 opened 3 months ago by Gforky
1
RuntimeError when running basic GAN model (from tutorial at lightning.ai) with DDP
#20328 opened 3 months ago by pranavrao-qure
3
`strict = False` does not work when the checkpoint is distributed
#20274 opened 3 months ago by NathanGodey
1
Bad practice in GAN example
#20331 opened 3 months ago by MrWhatZitToYaa
1
Support A Variable Number of Batches
#20330 opened 3 months ago by e-yi
1
Add list to torch.Tensor injection in yaml config
#20324 opened 3 months ago by fguiotte
0
best-k-metrics in ModelCheckpoint
#20321 opened 3 months ago by gonzachiar
0
Unable to serialize WandbLogger
#20315 opened 3 months ago by cwallenwein
2
Incosistant memory usage comparing to huggingface trainer when using deepspeed
#20299 opened 3 months ago by mickeysun0104
5
`hparams` not loaded when loading checkpoint via LightningCLI
#20310 opened 3 months ago by YouRik
0
Split `reload_dataloaders_every_n_epochs` into separate parameters for train, val and test dataloaders
#20309 opened 3 months ago by windring
3
`NeptuneCallback` produces lots of `X-coordinates (step) must be strictly increasing` errors
#20281 opened 3 months ago by iirekm
1
The problem shows: version incompatibility from v1.3.x to v2.4
#20308 opened 3 months ago by sunhan3787
1
NCCL backend fails during multi-node, multi-GPU training
#20306 opened 3 months ago by raketenolli
0
the example that shows "The LightningModule also has access to the Hyperparameters" is not correct
#20303 opened 3 months ago by XinleiRen
0
Problem in multi-gpu training
#20264 opened 4 months ago by xizaoqu
2
RichProgressBar: refresh_rate doesn't affect metric_component
#20300 opened 3 months ago by marios1861
0
Error encountered while using multiple optimizers inside a loop.
#20296 opened 3 months ago by RAraghavarora
0
Fabric does not sync gradients?
#20293 opened 3 months ago by RuABraun
0
Saving a checkpoint every n epochs does not work as expected
#20282 opened 3 months ago by olly-writes-code
2
WandbLogger will cause error on TPU v3-8
#20278 opened 3 months ago by buoyancy99
0
Lightning place model inputs and model to different devices
#20276 opened 3 months ago by Kamichanw
1
MLFlow logger returns None when MLFlow server is used
#20273 opened 4 months ago by lilruwu
0
Custom batch sampler fails to re-instantiate in `_dataloader_init_kwargs_resolve_sampler`
#20272 opened 4 months ago by Kamichanw
1
rich progress bar shows v_num as 0.000
#20268 opened 4 months ago by npuichigo
0
`_update_dataloader` improperly copies state of subclassed dataloader with attribute names that differ from `__init__` parameters.
#20265 opened 4 months ago by spenceforce
0