awslabs/gluonts

PyTorch Lightning logs are not synchronised when using distributed training

admivsn opened this issue · 1 comments

Description

As described in PyTorch Lightning documentation, the logs need to be synchronised using sync_dist=True.

For example in DeepAR, I think there should be an extra parameter when running distributed training.

I notice that when training on multi-GPU SageMaker instances I don't see the a performance uplift compared to a single-GPU instance. I also get a warning output from PyTorch Lightning.

It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

To Reproduce

It's difficult to reproduce as I'm running a SageMaker Training job.

SM_CURRENT_INSTANCE_TYPE=ml.g4dn.12xlarge
SM_NUM_CPUS=48
SM_NUM_GPUS=4

You can see by the set up there are 4 GPUs, which are detected by PyTorch Lightning as the logs look like this:

2024-04-17 20:37:48 Starting - Starting the training job...
2024-04-17 20:38:05 Starting - Preparing the instances for training......
2024-04-17 20:39:10 Downloading - Downloading input data...
2024-04-17 20:39:29 Downloading - Downloading the training image............
2024-04-17 20:41:50 Training - Training image download completed. Training in progress........2024-04-17 20:42:45,440 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)

...

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:67: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/configuration_validator.py:74: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Missing logger folder: /opt/ml/code/lightning_logs
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Missing logger folder: /opt/ml/code/lightning_logs
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /opt/ml/code/lightning_logs
Missing logger folder: /opt/ml/code/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
  | Name  | Type        | Params | In sizes                                                        | Out sizes  
----------------------------------------------------------------------------------------------------------------------
0 | model | DeepARModel | 25.9 K | [[1, 1], [1, 1], [1, 1102, 4], [1, 1102], [1, 1102], [1, 1, 4]] | [1, 100, 1]
----------------------------------------------------------------------------------------------------------------------
25.9 K    Trainable params
0         Non-trainable params
25.9 K    Total params
0.104     Total estimated model params size (MB)
/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:433: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
Epoch 0, global step 50: 'train_loss' reached 0.75536 (best 0.75536), saving model to '/opt/ml/code/lightning_logs/version_0/checkpoints/epoch=0-step=50.ckpt' as top 1
Epoch 1, global step 100: 'train_loss' reached 0.72144 (best 0.72144), saving model to '/opt/ml/code/lightning_logs/version_0/checkpoints/epoch=1-step=100.ckpt' as top 1

Error message or code output

The particular warning of interest is:

/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:433: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

Environment

  • Operating system:
  • Python version: 3.10
  • GluonTS version: 0.14.4
  • PyTorch version: 2.2.1
  • PyTorch Lightning version: 2.1.4

(Add as much information about your environment as possible, e.g. dependencies versions.)

Updated this with some more info. Originally I thought it was just when using validation data however upon investigations it seems like its a wider issue