What is your version of pytorch

Question

What is your version of pytorch

Closed this issue 5 years ago · 19 comments

ChuanMeng commented 5 years ago

Answer 1 · 2019-08-28T04:08:12.000Z

torch 1.0.0
Is there anything wrong with other versions?

Answer 2 · 2019-08-28T14:09:24.000Z

torch 1.0.1
torchtext 0.2.3
I have preprocessed the data successfully. But when I train the model, I got this error:

Answer 3 · 2019-08-28T14:23:39.000Z

You can try torchtext 0.4.0.

Answer 4 · 2019-08-29T02:58:06.000Z

Thanks, that worked! When I train it, I set the batch_size to 1, knl_seq_length_trunc to 200, but I still got a RuntimeError: CUDA error: out of memory (Tried to allocate 97.75 MiB (GPU 0; 10.91 GiB total capacity; 1.64 GiB already allocated; 67.38 MiB free; 125.27 MiB cached) Is this a matter of code, or GPU size?

Answer 5 · 2019-08-29T05:55:28.000Z

Thanks, that worked! When I train it, I set the batch_size to 1, knl_seq_length_trunc to 200, but I still got a RuntimeError: CUDA error: out of memory (Tried to allocate 97.75 MiB (GPU 0; 10.91 GiB total capacity; 1.64 GiB already allocated; 67.38 MiB free; 125.27 MiB cached) Is this a matter of code, or GPU size?

You can set batch_size to 1024 and accum_count to 32 if you use 1 GPU. It's ok on one GTX2080ti GPU with a memory of 11GB.

Answer 6 · 2019-08-29T07:28:49.000Z

I also get OOM, but setting batch_size to 1024 and accum_count to 32 also doesn't work.
My GPU is GTX2080ti GPU (single). Is there any other way without affecting the effect?

Answer 7 · 2019-08-29T07:58:23.000Z

OOM occurs every time the verification set is read. Training does not.

Answer 8 · 2019-08-29T09:02:25.000Z

OOM occurs every time the verification set is read. Training does not.

Can you try batch_size 512 and accum_count 64?

Answer 9 · 2019-08-30T02:16:18.000Z

When training, I was stuck at this point for a very long time and I don't know the reason...

Answer 10 · 2019-08-30T02:18:12.000Z

When training, I was stuck at this point for a very long time and I don't know the reason...

What's your config?

Answer 11 · 2019-08-30T02:25:40.000Z

batch size 32
accum_count 16
data: cmu_movie

Answer 12 · 2019-08-30T02:43:01.000Z

batch size 32
accum_count 16
data: cmu_movie

How about report_every and valid_steps?

Answer 13 · 2019-08-30T02:55:25.000Z

100, 1000, respectively
All other configuration are as same as the original one.

Answer 14 · 2019-08-30T04:53:24.000Z

100, 1000, respectively
All other configuration are as same as the original one.

Please don't set batch_size too small. Note that the batch_type in the config file is tokens (not sents).

Answer 15 · 2019-08-30T04:55:00.000Z

I have updated the configs. Add the valid_batch_size: 8. The default valid_batch_size is 32. That's why there is an OOM during validation.

Answer 16 · 2019-08-30T05:55:51.000Z

Thanks for the reply, now I have no OOM error but I'm still stuck here with this warning

Answer 17 · 2019-08-30T05:59:13.000Z

Thanks for the reply, now I have no OOM error but I'm still stuck here with this warning

You can try the newly updated config. The warning is inessential.

Answer 18 · 2019-08-30T09:05:38.000Z

Thanks, it was training very slow and I didn't see any logs, which made me wonder if it's stuck... One step takes maybe a few minutes.
The size of my training data is around 80000, if I set batch size 4096, it will take no more than 20 steps to run one epoch. I see the default training steps is 20000, I'm wondering how many epochs will it take before the model converges?

Answer 19 · 2019-08-30T12:40:05.000Z

Thanks, it was training very slow and I didn't see any logs, which made me wonder if it's stuck... One step takes maybe a few minutes.
The size of my training data is around 80000, if I set batch size 4096, it will take no more than 20 steps to run one epoch. I see the default training steps is 20000, I'm wondering how many epochs will it take before the model converges?

The batch_type in the config file is tokens (not sents).