thuhcsi/SECap

GPU requirements for training and inference

Closed this issue · 1 comments

Hi, it would be very helpful if the GPU-related info can be added to the documentation so that we know if we have enough VRAM needed for training or inference. Thanks!

We've trained and performed inference on a 40G V100, using 16-mixed precision for training and float32 for inference. During training, LLaMA didn't utilize float32 parameters for import, allowing us to reach a batch size of 16. For inference, we've adopted a strategy of processing one audio piece at a time. To process the entire test set, which consists of 600 sentences, it took us 10 minutes on eight 40G V100s. If you're working with a different GPU, you might want to experiment with varying the batch size and trying out different inference methods.