google-research/smore

Evaluation get stuck

Juanhui28 opened this issue · 8 comments

Hi,

Seems there is still a chance for the evalution to get stuck. When we run the train_shallow_wikikgv2.sh , it runs after 4799999 steps and gets stuck in the evaluation. When we stop it with keyboard interrupt, we got the following message:

截屏2022-10-13 下午10 20 48

And when we run the train_concat_wikikgv2.sh , it stucks at the first time for the evaluation. When we stop it with keyboard interrupt, it shows similar error messages with the train_shallow_wikikgv2.sh.
截屏2022-10-13 下午10 23 14

Could you please help to check? Any help is appreciated!

hyren commented

Hi, can you try running with a single GPU?

Hi,

We tried a single gpu on both train_shallow_wikikgv2.sh and train_concat_wikikgv2.sh, they both stuck in the evalution. Thanks.

hyren commented

Just to make sure, have you pulled the latest change? What is the script you are running? We will look into this and reproduce.

Hi, yes we have already pulled the latest change. We are running train_shallow_wikikgv2.sh and train_concat_wikikgv2.sh in the training/vec_scripts folder. Thanks!

Hi there,
I'm not sure if the gpu is compatible with the async op. Could you please kindly try to add --train_async_rw=False flag?

Hi,
Thank you for the follow up. We add this flag in the script. And actually we found there is still a chance for the training to stuck with multiple gpusm, but it goes well with single gpu.
Thank you!

really sorry for the back-and-forth! I guess it is mostly due to the compatibility of customized kernel.
Would you mind sharing more information of the versions for your CUDA, pytorch and python?

Hi, the information is listed as follows:
CUDA: 11.6
pytorch: 1.12.1
python: 3.9