Any benchmark results?
Closed this issue · 12 comments
What GPUs are you using? What's the maximum batch_size for 11/12/16/32 GB GPUs (e.g.: GTX 1080 Ti, Titan X, V100) and the corresponding performance? Thanks.
-
What GPUs are you using?
With batch_size of 24(each GPU) and Tensorflow 1.13.1, I successfully train a classifier based onbert-large-uncased
on 2 x Tesla P40(about 95% RAM of GPUs used) for QQP dataset. And the prediction looks fine. -
What's the maximum batch_size for 11/12/16/32 GB GPUs (e.g.: GTX 1080 Ti, Titan X, V100)?
tf.distribute.MirroredStrategy
is used to achieve Multi-GPU for this project, which mirrors vars to distribute across multiple devices and machines. I think the maximum batch_size for each GPU is almost the same as bert. So global batch_size depends on how many GPUs there are. -
Corresponding performance
The training speed and the number of GPUs are almost linear with the same hyperparameters.
Thanks. Does 24 indicate the global batch size? i.e.: 12 samples of length 128 on each Tesla P40 card (24 GB memory)?
24 is the batch size for each GPU. Global batch size is 24 * 2 = 48.
Oops, this is awesome. Your batch size is twice as large as that reported in BERT readme (after scaling according to memory size). How did this happen? By using fp16 precision?
No fp16, but I'm planning to support it.
total RAM usage = model RAM usage + batch_size × memory RAM per sample. So batch_size and RAM usage are not linear, BERT is too large. On the other hand, maybe there are some performance optimizations from Tensorflow1.11 to 1.13.1.
Thanks for the clarification. A nice project and I'm playing with it.
Feel free to open a new issue if you encounter any problems while you are playing.
Now fp16 is available on branch fp16
, feel free to use it. I will update readme and merge it to master
later.
@soloice
Thanks for your quick development. Currently I don't have access to a Volta architecture GPU, so I guess the fp16 performance would be much slower than its fp32 counterpart? e.g.: for GTX 1080 Ti and Tesla P40 (both of them are of Pascal architecture), the fp16 performance is 1:64 (1/64 of fp32 FLOPS).
I'm facing some strange issues with BERT-Large on my 11 GB TX 1080 Ti.
Code | model size | maximum_sequence_length | max_batch_size_per_GPU | remark |
---|---|---|---|---|
original bert | base | 128 | 36 | Pretty good. Even larger than 32 on a 12 GB Titan X as mentioned in the BERT repo |
original bert | large | 64 | 2 | Very poor. Titan X could hold 12 such samples! |
this repo | large | 64 | 0 | Oops. This is a disaster. |
I guess the poor performance is not due to this repo but the original BERT repo. Have you encountered any strange issue with BERT-Large like this?
Thanks for your quick development. Currently I don't have access to a Volta architecture GPU, so I guess the fp16 performance would be much slower than its fp32 counterpart? e.g.: for GTX 1080 Ti and Tesla P40 (both of them are of Pascal architecture), the fp16 performance is 1:64 (1/64 of fp32 FLOPS).
I only did a simple test for FP16, and the training speed did not drop significantly (probably I made a mistake). I will do more detailed and comprehensive testing later.
I guess the poor performance is not due to this repo but the original BERT repo. Have you encountered any strange issue with BERT-Large like this?
It's weird. Did OOM occur when training the LARGE model with this repo?
It is weird. Did OOM occur when training LARGE model with this repo? I don't recommend trying bert-large-uncased
on GPUs(RAM < 16GB), because Multi-GPU has very little benefit for training in this case. Maybe training your classifier with the bert-base-uncased
model is a better option if you can tolerate a 1% reduction in eval_accuracy
.
REF: google-research/bert#4 (comment)
Did OOM occur when training LARGE model with this repo?
Yes, even a batch size of 1 leads to OOM. I'd better play with BERT-Base. Thanks for your quick reply.