training config

Question

training config

Closed this issue a year ago · 4 comments

To replicate and build upon your results, it is crucial for me to have a comprehensive understanding of the training configuration employed during the experiments. Is the examples/DA-Transformer/wmt14_ende.sh the config used to get the results in your paper. I found it impossible to finished 300,000 updates within 16 hours using 8*A100 using that config.

Answer 1 · 2023-05-23T05:12:19.000Z

Can you estimate the training time on your device?

By the way, if you have A100 GPUs with 80GB memory, training will be a little faster to set max_tokens=8192 and update_freq=1.

Answer 2 · 2023-05-23T05:15:58.000Z

It takes around 1s to finish 1 update, so the total training time is 83 hours.

Answer 3 · 2023-05-23T05:43:28.000Z

According to our paper, the training process will take around 32 hours using 16xV100 GPUs. I have tried running the codes on my 8xA100 server, and it typically completes 2 to 3 updates in 1 second. This means that for 300k updates, it would take approximately 30 to 40 hours.

I'm not sure why your server is running slow. To investigate the issue, you can check the GPU-Utils to determine if the slow performance is due to GPU computation or other factors like CPU or I/O.

Please ensure the following:

Make sure your server has sufficient CPU memory available. Check if there are any other programs consuming excessive CPU resources. Also, ensure that the disk I/O is fast. If you're reading the dataset from a remote disk, such as a cloud server, it could significantly slow down the process.
Confirm that you are using ls_glat_decomposed_base (which indicates that lightseq is enabled) and that there are no "--torch-xxx" flags in your scripts (which would disable custom CUDA operations).

Answer 4 · 2023-05-24T10:40:26.000Z

I didn't use lightseq. It is much faster after I switch to lightseq.