Discussing LoraHub: Exploration, Implementation, and Potential Improvements
ChuxiJ opened this issue · 1 comments
ChuxiJ commented
LoraHub is a really great idea, similar to a few ideas I thought of yesterday.
- Unlike MOE, instead of training many domain experts, it trains multiple Loras on a large base model.
- During inference, a router mechanism is used to select which Lora weights to combine for inference. Only one base model is needed for deployment. Like a chain of trees, if you infer several times, you can achieve better performance.
- The training parameters and data for Lora can be made more aggressive, ready to scale up. For example, a 65B base model, trained on high-quality data from 8 different domains, separately trains 8 1B Loras. Has anyone compared whether its performance is better or worse than MOE?
- It is not yet very clear which base models were chosen in the paper, how the training parameters were, how the Loras were merged for inference, and many other details. I am waiting for the code to be published for more details.
- How to cleverly design the router mechanism is also worth researching and discussing. Are there any related materials to recommend?
SivilTaram commented
Thanks for your question @ChuxiJ , and I'd answer them here:
- Yes, this is exactly what lorahub wants to implement. And we have also discussed the relationship of lorahub with moe in the related work section.
- I'm not sure if the scalar weight can be a router since the router mechanism in MoE should include many router weights.
- 😂 It's a little expensive for our Lab to train such kind of models. But it's worthy to try.
- We clearly state that
Flan-t5-large
is the base model. You can checkout the first section and the experimental results for details. All these details are already in the paper. - The gradient-free method used in lorahub may be a great solution.