using A100（40G）*8 gpus server to train T5-3b，it reports OOM resource is exhausted problem

Question

using A100（40G）*8 gpus server to train T5-3b，it reports OOM resource is exhausted problem

flyingwaters opened this issue 2 years ago · 2 comments

Why ？I used model：8 batch 1 settings and 8 batchsize {input：1024， output：512}， there is still a OOM。
But I see Pytorch can train T5-3b at V100 32G * 8 ， whether It was caused by Mesh-tensorflow‘s less efficient than deepspeed？？
I want to figure out the problem？？ actually I want use T5 because Tensorflow is easy for deployment. Why deepspeed can train larger model than mesh_tensorflow????

Answer 1 · 2023-05-12T01:45:33.000Z

@flyingwaters 哈喽你好。请问下你是怎么训练起来这个T5模型的？我一直想自己训练一个预训练模型的，哪怕是小参数量小数据量的都行。但是我发现这个项目都是用的云TPU。想能否请教你一下~

Answer 2 · 2023-06-06T08:11:27.000Z

您好，打扰一下，想请问这个是怎么使用的啊，怎么和huggingface上的完全不一样啊