DachengLi1/LongChat

About the learning rate

Opened this issue · 1 comments

from the script provided, I think longchat is full sft rather than lora, but the equal batch size total is just 1 (batch_size * gradient_accum * num_gpus)

But vicuna original fschat training full params sft, using equal batch size of 128, why lr is different? Which one should be adopted if only have 2 80G ?

@lucasjinreal I think either is fine - you can go with the largest batch size your gpu support, either with or without gradient accumulation,