Train for 500 nodes

Question

Train for 500 nodes

souravsanyal06 opened this issue 4 years ago · 2 comments

I tried to train for 500 nodes. GPU is running out of memory (CUDA out of memory). Did you try with 500 nodes ? If yes, did you do anything to solve this ?

Answer 1 · 2021-01-15T04:16:44.000Z

Hi @souravsanyal06, thank you for your interest! You are correct, I also ran into OOM for large graphs. If you want to use this codebase for 500 node training, I suggest reducing number of GNN layers as well as batch size. However, I believe this will make it very hard to achieve good performance.

Another option is to use gradient accumulation with small batches but overall large virtual batches, which the codebase already provides.

Finally, I would like to point you to our latest work on exactly this topic: how to learn on very large scale TSPs? Here is the paper: https://arxiv.org/pdf/2006.07054.pdf; and the associated codebase: https://github.com/chaitjo/learning-tsp.

The new repo allows you to train on larger graphs than before, so do check it out depending on your usecase.

Hope this helps :)

Answer 2 · 2021-01-15T06:22:31.000Z

Thanks a lot! I This is very helpful.