google-research/text-to-text-transfer-transformer

using A100(40G)*8 gpus server to train T5-3b,it reports OOM resource is exhausted problem

flyingwaters opened this issue · 2 comments

Why ?I used model:8 batch 1 settings and 8 batchsize {input:1024, output:512}, there is still a OOM。
But I see Pytorch can train T5-3b at V100 32G * 8 , whether It was caused by Mesh-tensorflow‘s less efficient than deepspeed??
I want to figure out the problem?? actually I want use T5 because Tensorflow is easy for deployment. Why deepspeed can train larger model than mesh_tensorflow????

@flyingwaters 哈喽你好。请问下你是怎么训练起来这个T5模型的?我一直想自己训练一个预训练模型的,哪怕是小参数量小数据量的都行。但是我发现这个项目都是用的云TPU。想能否请教你一下~

您好,打扰一下,想请问这个是怎么使用的啊,怎么和huggingface上的完全不一样啊