lizekang/ITDD

What is your version of pytorch

Closed this issue · 19 comments

What is your version of pytorch

torch 1.0.0
Is there anything wrong with other versions?

torch 1.0.1
torchtext 0.2.3
I have preprocessed the data successfully. But when I train the model, I got this error:
image

You can try torchtext 0.4.0.

Thanks, that worked! When I train it, I set the batch_size to 1, knl_seq_length_trunc to 200, but I still got a RuntimeError: CUDA error: out of memory (Tried to allocate 97.75 MiB (GPU 0; 10.91 GiB total capacity; 1.64 GiB already allocated; 67.38 MiB free; 125.27 MiB cached) Is this a matter of code, or GPU size?

Thanks, that worked! When I train it, I set the batch_size to 1, knl_seq_length_trunc to 200, but I still got a RuntimeError: CUDA error: out of memory (Tried to allocate 97.75 MiB (GPU 0; 10.91 GiB total capacity; 1.64 GiB already allocated; 67.38 MiB free; 125.27 MiB cached) Is this a matter of code, or GPU size?

You can set batch_size to 1024 and accum_count to 32 if you use 1 GPU. It's ok on one GTX2080ti GPU with a memory of 11GB.

I also get OOM, but setting batch_size to 1024 and accum_count to 32 also doesn't work.
My GPU is GTX2080ti GPU (single). Is there any other way without affecting the effect?

OOM occurs every time the verification set is read. Training does not.

OOM occurs every time the verification set is read. Training does not.

Can you try batch_size 512 and accum_count 64?

When training, I was stuck at this point for a very long time and I don't know the reason...
image

When training, I was stuck at this point for a very long time and I don't know the reason...
image

What's your config?

batch size 32
accum_count 16
data: cmu_movie

batch size 32
accum_count 16
data: cmu_movie

How about report_every and valid_steps?

100, 1000, respectively
All other configuration are as same as the original one.

100, 1000, respectively
All other configuration are as same as the original one.

Please don't set batch_size too small. Note that the batch_type in the config file is tokens (not sents).

I have updated the configs. Add the valid_batch_size: 8. The default valid_batch_size is 32. That's why there is an OOM during validation.

Thanks for the reply, now I have no OOM error but I'm still stuck here with this warning
image

Thanks for the reply, now I have no OOM error but I'm still stuck here with this warning
image

You can try the newly updated config. The warning is inessential.

Thanks, it was training very slow and I didn't see any logs, which made me wonder if it's stuck... One step takes maybe a few minutes.
The size of my training data is around 80000, if I set batch size 4096, it will take no more than 20 steps to run one epoch. I see the default training steps is 20000, I'm wondering how many epochs will it take before the model converges?

Thanks, it was training very slow and I didn't see any logs, which made me wonder if it's stuck... One step takes maybe a few minutes.
The size of my training data is around 80000, if I set batch size 4096, it will take no more than 20 steps to run one epoch. I see the default training steps is 20000, I'm wondering how many epochs will it take before the model converges?

The batch_type in the config file is tokens (not sents).