About the learning rate

Question

About the learning rate

Opened this issue a year ago · 1 comments

from the script provided, I think longchat is full sft rather than lora, but the equal batch size total is just 1 (batch_size * gradient_accum * num_gpus)

But vicuna original fschat training full params sft, using equal batch size of 128, why lr is different? Which one should be adopted if only have 2 80G ?

Answer 1 · 2023-07-22T06:02:25.000Z

@lucasjinreal I think either is fine - you can go with the largest batch size your gpu support, either with or without gradient accumulation,