Out-Of-Memory regardless of setting batch_size

Question

Out-Of-Memory regardless of setting batch_size

haithanhp opened this issue 3 years ago · 10 comments

I run your code in main.py of TransMatcher but the Out-Of-Memory issue is still there regardless of setting batch_size. Do you know how to modify it to make it efficient? The code line of this issue:

score = einsum('q t d, k s d -> q k s t', query, key) * self.score_embed.sigmoid()

Answer 1 · 2022-02-06T05:44:08.000Z

What is the GPU card you used? I was using V100 with 32GB memory without issues, but I did not check those with lower memory. You can try lower the batch size, --num_trans_layers, --dim_feedforward, and --neck.

Answer 2 · 2022-02-06T05:47:07.000Z

I used 8 GPUs of V-100 Tesla 16GB but when I set to batch_size of 8, it still has an out-of-memory error.

Answer 3 · 2022-02-06T06:38:42.000Z

I used only one V100, with a batch size of 64. It seems that you should be able to run it with batch size of 8 in one of your GPU.

Answer 4 · 2022-02-06T22:34:05.000Z

How many GB memory for your V100? Is it 32 GB? I just have 16GB and it says that 9.0 GB is required.

Answer 5 · 2022-02-07T04:43:34.000Z

Yes, as I said, 32GB.

Answer 6 · 2022-02-07T04:48:40.000Z

I see. So, we are unable to run the code with lower memory even multiple GPUs and small batch size. But it is weird that the efficient model can not be trained with batch size of 1. Could you show me where to update the code to make it run with lower memory? Thanks

Answer 7 · 2022-02-07T05:28:01.000Z

I'm not sure. I still feel strange that you cannot run the code with batch size of 8 in one V100 (16GB), since I'm able to run with batch size of 64 in one V100 (32GB). Did you try one GPU?

For further suggestions, please see the below.

What is the GPU card you used? I was using V100 with 32GB memory without issues, but I did not check those with lower memory. You can try lower the batch size, --num_trans_layers, --dim_feedforward, and --neck.

Answer 8 · 2022-02-16T09:40:03.000Z

I have encountered this issue and it seems that you need to change the test_batch smaller. Because the default size of the test batch is 256.

Answer 9 · 2022-02-16T10:07:16.000Z

Thanks. Yes, could be this reason. I thought @haithanhp encountered this during training.

Answer 10 · 2022-02-17T04:43:36.000Z

Thanks for your help. I used the server with larger GPU memory. So, the issue is no longer there.