k2-fsa/icefall

Different Training Loss with Single Node (8 GPUs) vs. Two Nodes (4 GPUs Each)

dohe0342 opened this issue · 0 comments

Description:

I am experiencing a discrepancy in training loss when using different GPU configurations for training the Zipformer model. Specifically, I observe different training loss patterns when training on a single node with 8 GPUs compared to training on two nodes with 4 GPUs each.

Details:
• Single Node Configuration:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
./zipformer/train.py \
  --world-size 8 \
  --num-epochs 40 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp-ctc-rnnt \
  --causal 0 \
  --use-transducer 1 \
  --use-ctc 1 \
  --ctc-loss-scale 0.2 \
  --full-libri 1 \
  --max-duration 2000
2024-05-17 01:40:34,940 INFO [train.py:1102] (0/8) Training started
2024-05-17 01:40:34,947 INFO [train.py:1112] (0/8) Device: cuda:0
2024-05-17 01:40:34,952 INFO [train.py:1124] (0/8) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ff1d435a8d3c4eaa15828a84a7240678a70539a7', 'k2-git-date': 'Fri Feb 23 01:48:38 2024', 'lhotse-version': '1.22.0.dev+git.d8ed1bb.clean', 'torch-version': '2.2.1', 'torch-cuda-available': True, 'torch-cuda-version': '12.1', 'python-version': '3.1', 'icefall-git-branch': 'main', 'icefall-git-sha1': '1101307-clean', 'icefall-git-date': 'Fri May 17 01:39:56 2024', 'icefall-path': '/workspace/icefall_kt', 'k2-path': '/opt/conda/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/opt/conda/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'k-atc15', 'IP address': '10.10.0.15'}, 'world_size': 8, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/zipformer_single-node_bs2000'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': True, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 2000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 9, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}
2024-05-17 01:40:34,952 INFO [train.py:1126] (0/8) About to create model
2024-05-17 01:40:35,764 INFO [train.py:1130] (0/8) Number of model parameters: 65805511
2024-05-17 01:40:36,649 INFO [train.py:1145] (0/8) Using DDP
2024-05-17 01:40:39,090 INFO [asr_datamodule.py:436] (0/8) About to get the shuffled train-clean-100,             train-clean-360 and train-other-500 cuts
2024-05-17 01:40:39,091 INFO [asr_datamodule.py:232] (0/8) Enable MUSAN
2024-05-17 01:40:39,092 INFO [asr_datamodule.py:233] (0/8) About to get Musan cuts
2024-05-17 01:40:41,394 INFO [asr_datamodule.py:257] (0/8) Enable SpecAugment
2024-05-17 01:40:41,394 INFO [asr_datamodule.py:258] (0/8) Time warp factor: 80
2024-05-17 01:40:41,394 INFO [asr_datamodule.py:268] (0/8) Num frame mask: 10
2024-05-17 01:40:41,394 INFO [asr_datamodule.py:281] (0/8) About to create train dataset
2024-05-17 01:40:41,394 INFO [asr_datamodule.py:308] (0/8) Using DynamicBucketingSampler.
2024-05-17 01:40:42,645 INFO [asr_datamodule.py:325] (0/8) About to create train dataloader
2024-05-17 01:40:42,646 INFO [asr_datamodule.py:453] (0/8) About to get dev-clean cuts
2024-05-17 01:40:42,647 INFO [asr_datamodule.py:460] (0/8) About to get dev-other cuts
2024-05-17 01:40:42,648 INFO [asr_datamodule.py:356] (0/8) About to create dev dataset
2024-05-17 01:40:42,919 INFO [asr_datamodule.py:373] (0/8) About to create dev dataloader
2024-05-17 01:40:42,919 INFO [train.py:1349] (0/8) Sanity check -- see if any of the batches in epoch 1 would cause OOM.
2024-05-17 01:45:02,415 INFO [train.py:1377] (0/8) Maximum memory allocated so far is 28221MB
2024-05-17 01:45:04,445 INFO [train.py:1377] (0/8) Maximum memory allocated so far is 28221MB
2024-05-17 01:45:06,699 INFO [train.py:1377] (0/8) Maximum memory allocated so far is 28221MB
2024-05-17 01:45:08,975 INFO [train.py:1377] (0/8) Maximum memory allocated so far is 28221MB
2024-05-17 01:45:11,624 INFO [train.py:1377] (0/8) Maximum memory allocated so far is 28221MB
2024-05-17 01:45:14,296 INFO [train.py:1377] (0/8) Maximum memory allocated so far is 28221MB
2024-05-17 01:46:28,187 INFO [train.py:1034] (0/8) Epoch 1, batch 0, loss[loss=8.493, simple_loss=6.845, pruned_loss=6.741, ctc_loss=4.865, over 49332.00 frames. ], tot_loss[loss=8.493, simple_loss=6.845, pruned_loss=6.741, ctc_loss=4.865, over 49332.00 frames. ], batch size: 115, lr: 2.25e-02, grad_scale: 1.0
2024-05-17 01:46:28,189 INFO [train.py:1057] (0/8) Computing validation loss
2024-05-17 01:46:36,929 INFO [train.py:1066] (0/8) Epoch 1, validation: loss=8.6, simple_loss=6.902, pruned_loss=6.767, ctc_loss=5.098, over 944034.00 frames. 
2024-05-17 01:46:36,930 INFO [train.py:1067] (0/8) Maximum memory allocated so far is 30678MB
2024-05-17 01:47:11,579 WARNING [optim.py:487] (0/8) Clipping_scale=2.0, grad-norm quartiles 9.180e+03 1.089e+04 1.176e+04 1.475e+04 1.540e+04, threshold=4.705e+04, percent-clipped=0.0
2024-05-17 01:47:27,666 WARNING [optim.py:487] (0/8) Clipping_scale=2.0, grad-norm quartiles 1.545e+03 6.219e+03 1.089e+04 1.382e+04 1.593e+04, threshold=4.356e+04, percent-clipped=0.0
2024-05-17 01:48:09,466 WARNING [optim.py:487] (0/8) Clipping_scale=2.0, grad-norm quartiles 1.433e+03 2.173e+03 3.548e+03 1.089e+04 1.593e+04, threshold=1.419e+04, percent-clipped=0.0
2024-05-17 01:48:24,230 INFO [train.py:1034] (0/8) Epoch 1, batch 50, loss[loss=1.621, simple_loss=1.207, pruned_loss=1.377, ctc_loss=1.301, over 49372.00 frames. ], tot_loss[loss=3.721, simple_loss=3.037, pruned_loss=2.626, ctc_loss=2.074, over 2196903.96 frames. ], batch size: 138, lr: 2.48e-02, grad_scale: 0.125
2024-05-17 01:50:01,749 INFO [train.py:1034] (0/8) Epoch 1, batch 100, loss[loss=1.469, simple_loss=1.053, pruned_loss=1.319, ctc_loss=1.255, over 49337.00 frames. ], tot_loss[loss=2.498, simple_loss=1.971, pruned_loss=1.9, ctc_loss=1.602, over 3840175.08 frames. ], batch size: 148, lr: 2.70e-02, grad_scale: 0.25
2024-05-17 01:50:08,608 WARNING [optim.py:487] (0/8) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 2.555e+02 6.058e+02 2.486e+03 1.593e+04, threshold=1.212e+03, percent-clipped=0.0
2024-05-17 01:51:37,388 INFO [train.py:1034] (0/8) Epoch 1, batch 150, loss[loss=1.186, simple_loss=0.8216, pruned_loss=1.026, ctc_loss=1.117, over 49302.00 frames. ], tot_loss[loss=2, simple_loss=1.532, pruned_loss=1.585, ctc_loss=1.424, over 5119584.39 frames. ], batch size: 127, lr: 2.93e-02, grad_scale: 0.25
2024-05-17 01:53:16,825 INFO [train.py:1034] (0/8) Epoch 1, batch 200, loss[loss=1.175, simple_loss=0.8008, pruned_loss=0.9635, ctc_loss=1.153, over 49357.00 frames. ], tot_loss[loss=1.726, simple_loss=1.288, pruned_loss=1.386, ctc_loss=1.336, over 6132449.37 frames. ], batch size: 129, lr: 3.15e-02, grad_scale: 0.5
2024-05-17 01:53:23,316 WARNING [optim.py:487] (0/8) Clipping_scale=2.0, grad-norm quartiles 8.904e+01 1.659e+02 2.309e+02 3.177e+02 6.286e+02, threshold=4.619e+02, percent-clipped=0.0
2024-05-17 01:54:56,246 INFO [train.py:1034] (0/8) Epoch 1, batch 250, loss[loss=1.14, simple_loss=0.7688, pruned_loss=0.9052, ctc_loss=1.135, over 49466.00 frames. ], tot_loss[loss=1.553, simple_loss=1.135, pruned_loss=1.246, ctc_loss=1.28, over 6899154.91 frames. ], batch size: 131, lr: 3.38e-02, grad_scale: 0.5
2024-05-17 01:55:26,371 INFO [scaling.py:1119] (0/8) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=4.082e-01
•	Two Nodes Configuration: same as single-node
•	GPUs: 4 per node (total 8 GPUs)
•	Training Loss:
2024-05-20 06:46:48,829 INFO [train_multinode.py:1128] (0/8) Training started
2024-05-20 06:46:48,839 INFO [train_multinode.py:1139] (0/8) Device: cuda:0, rank: 0, local_rank: 0
2024-05-20 06:46:48,842 INFO [train_multinode.py:1152] (0/8) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ff1d435a8d3c4eaa15828a84a7240678a70539a7', 'k2-git-date': 'Fri Feb 23 01:48:38 2024', 'lhotse-version': '1.22.0.dev+git.d8ed1bb.clean', 'torch-version': '2.2.1', 'torch-cuda-available': True, 'torch-cuda-version': '12.1', 'python-version': '3.1', 'icefall-git-branch': 'main', 'icefall-git-sha1': 'af9a696-clean', 'icefall-git-date': 'Mon May 20 06:46:37 2024', 'icefall-path': '/workspace/icefall_kt', 'k2-path': '/opt/conda/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/opt/conda/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'k-atc13', 'IP address': '10.10.0.13'}, 'world_size': 8, 'use_multi_node': True, 'master_port': 12356, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/zipformer_multi2_lr0.045'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': True, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 2000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 9, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}
2024-05-20 06:46:48,842 INFO [train_multinode.py:1154] (0/8) About to create model
2024-05-20 06:46:49,525 INFO [train_multinode.py:1158] (0/8) Number of model parameters: 65805511
2024-05-20 06:46:50,879 INFO [train_multinode.py:1173] (0/8) Using DDP:0
2024-05-20 06:46:52,616 INFO [asr_datamodule.py:449] (0/8) About to get the shuffled train-clean-100,             train-clean-360 and train-other-500 cuts
2024-05-20 06:46:52,617 INFO [asr_datamodule.py:234] (0/8) Enable MUSAN
2024-05-20 06:46:52,618 INFO [asr_datamodule.py:235] (0/8) About to get Musan cuts
2024-05-20 06:46:54,953 INFO [asr_datamodule.py:259] (0/8) Enable SpecAugment
2024-05-20 06:46:54,953 INFO [asr_datamodule.py:260] (0/8) Time warp factor: 80
2024-05-20 06:46:54,953 INFO [asr_datamodule.py:270] (0/8) Num frame mask: 10
2024-05-20 06:46:54,954 INFO [asr_datamodule.py:283] (0/8) About to create train dataset
2024-05-20 06:46:54,954 INFO [asr_datamodule.py:310] (0/8) Using DynamicBucketingSampler.
2024-05-20 06:46:56,227 INFO [asr_datamodule.py:331] (0/8) About to create train dataloader
2024-05-20 06:46:56,228 INFO [asr_datamodule.py:466] (0/8) About to get dev-clean cuts
2024-05-20 06:46:56,229 INFO [asr_datamodule.py:473] (0/8) About to get dev-other cuts
2024-05-20 06:46:56,230 INFO [asr_datamodule.py:367] (0/8) About to create dev dataset
2024-05-20 06:46:56,501 INFO [asr_datamodule.py:386] (0/8) About to create dev dataloader
2024-05-20 06:47:39,210 INFO [train_multinode.py:1050] (0/8) Epoch 1, batch 0, loss[loss=8.493, simple_loss=6.845, pruned_loss=6.741, ctc_loss=4.865, over 49332.00 frames. ], tot_loss[loss=8.493, simple_loss=6.845, pruned_loss=6.741, ctc_loss=4.865, over 49332.00 frames. ], batch size: 115, lr: 2.25e-02, grad_scale: 1.0
2024-05-20 06:47:39,211 INFO [train_multinode.py:1073] (0/8) Computing validation loss
2024-05-20 06:47:42,665 INFO [train_multinode.py:1082] (0/8) Epoch 1, validation: loss=8.6, simple_loss=6.902, pruned_loss=6.767, ctc_loss=5.098, over 944034.00 frames. 
2024-05-20 06:47:42,666 INFO [train_multinode.py:1083] (0/8) Maximum memory allocated so far is 27713MB
2024-05-20 06:48:06,659 WARNING [optim.py:487] (0/8) Clipping_scale=2.0, grad-norm quartiles 9.248e+03 1.090e+04 1.162e+04 1.493e+04 1.599e+04, threshold=4.648e+04, percent-clipped=0.0
2024-05-20 06:48:24,428 WARNING [optim.py:487] (0/8) Clipping_scale=2.0, grad-norm quartiles 5.404e+03 1.090e+04 1.571e+04 4.438e+04 4.201e+05, threshold=6.285e+04, percent-clipped=15.0
2024-05-20 06:49:06,044 WARNING [optim.py:487] (0/8) Clipping_scale=2.0, grad-norm quartiles 3.034e+03 5.404e+03 1.162e+04 3.348e+04 4.201e+05, threshold=4.648e+04, percent-clipped=5.0
2024-05-20 06:49:11,054 INFO [train_multinode.py:1050] (0/8) Epoch 1, batch 50, loss[loss=1.772, simple_loss=1.32, pruned_loss=1.443, ctc_loss=1.458, over 49372.00 frames. ], tot_loss[loss=3.947, simple_loss=3.193, pruned_loss=2.685, ctc_loss=2.397, over 2196903.96 frames. ], batch size: 138, lr: 2.48e-02, grad_scale: 0.015625
2024-05-20 06:50:24,483 INFO [scaling.py:1119] (0/8) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00
2024-05-20 06:50:40,084 INFO [train_multinode.py:1050] (0/8) Epoch 1, batch 100, loss[loss=1.536, simple_loss=1.102, pruned_loss=1.36, ctc_loss=1.319, over 49337.00 frames. ], tot_loss[loss=2.633, simple_loss=2.061, pruned_loss=1.94, ctc_loss=1.807, over 3840175.08 frames. ], batch size: 148, lr: 2.70e-02, grad_scale: 0.03125
2024-05-20 06:50:51,309 WARNING [optim.py:487] (0/8) Clipping_scale=2.0, grad-norm quartiles 2.249e+02 5.737e+02 3.127e+03 1.571e+04 4.321e+05, threshold=6.254e+03, percent-clipped=5.0
2024-05-20 06:52:09,509 INFO [train_multinode.py:1050] (0/8) Epoch 1, batch 150, loss[loss=1.207, simple_loss=0.8368, pruned_loss=1.059, ctc_loss=1.119, over 49302.00 frames. ], tot_loss[loss=2.094, simple_loss=1.596, pruned_loss=1.626, ctc_loss=1.552, over 5119584.39 frames. ], batch size: 127, lr: 2.93e-02, grad_scale: 0.03125
2024-05-20 06:53:46,033 INFO [train_multinode.py:1050] (0/8) Epoch 1, batch 200, loss[loss=1.186, simple_loss=0.8088, pruned_loss=0.9797, ctc_loss=1.154, over 49357.00 frames. ], tot_loss[loss=1.792, simple_loss=1.334, pruned_loss=1.42, ctc_loss=1.42, over 6132449.37 frames. ], batch size: 129, lr: 3.15e-02, grad_scale: 0.0625
2024-05-20 06:54:06,999 WARNING [optim.py:487] (0/8) Clipping_scale=2.0, grad-norm quartiles 8.927e+01 1.526e+02 2.079e+02 3.394e+02 3.408e+03, threshold=4.158e+02, percent-clipped=0.0
2024-05-20 06:55:27,049 INFO [train_multinode.py:1050] (0/8) Epoch 1, batch 250, loss[loss=1.147, simple_loss=0.7717, pruned_loss=0.9151, ctc_loss=1.145, over 49466.00 frames. ], tot_loss[loss=1.601, simple_loss=1.168, pruned_loss=1.273, ctc_loss=1.339, over 6899154.91 frames. ], batch size: 131, lr: 3.38e-02, grad_scale: 0.0625
2024-05-20 06:56:59,127 INFO [train_multinode.py:1050] (0/8) Epoch 1, batch 300, loss[loss=1.261, simple_loss=0.8356, pruned_loss=1.003, ctc_loss=1.263, over 49494.00 frames. ], tot_loss[loss=1.478, simple_loss=1.057, pruned_loss=1.175, ctc_loss=1.29, over 7513477.20 frames. ], batch size: 151, lr: 3.60e-02, grad_scale: 0.125
2024-05-20 06:57:09,926 WARNING [optim.py:487] (0/8) Clipping_scale=2.0, grad-norm quartiles 1.013e+02 2.217e+02 4.333e+02 7.160e+02 1.237e+03, threshold=8.666e+02, percent-clipped=49.0
2024-05-20 06:58:29,737 INFO [train_multinode.py:1050] (0/8) Epoch 1, batch 350, loss[loss=1.119, simple_loss=0.7369, pruned_loss=0.8712, ctc_loss=1.108, over 49448.00 frames. ], tot_loss[loss=1.394, simple_loss=0.9808, pruned_loss=1.103, ctc_loss=1.257, over 7978506.85 frames. ], batch size: 124, lr: 3.83e-02, grad_scale: 0.125

Observations:
• The training loss converges differently and shows significant variation between the two

Expected Behavior:
The training loss should be consistent across different GPU configurations, assuming all other hyperparameters and settings are kept constant.

Environment:
• PyTorch Version: 2.2.1
• CUDA Version: 12.1

Would appreciate any insights or suggestions on why this discrepancy might be occurring and how to resolve it.

Thank you!