databricks/megablocks

About the Multi-node Script

XingyuXie opened this issue · 4 comments

Thanks for the excellent work.

I really try to fine-tune the mistral 8x7b model based on the codebase. It is really convenient to share a launch script for multi-node cases. By the way, how could I load the 8x7b weight? It seems that there is no weight conversion script.

Hi! MegaBlocks only defines the MoE layers so you'll have to use another framework for the remainder of the model. We use Megatron-LM for our experiments but I am not sure they support Mixtral 8x7B yet. You could also try HuggingFace, which I suspect would be easier than Megatron-LM.

Really thanks for the quick response. I also prefer and am familiar with the Megatron framework.

Is it convenient to share an example of Multi-node training based on the Megatron-LM for a 7B MoE model? It's okay to omit some important hyperparameters.

The multi-node example is important for our group. We will definitely cite and acknowledge you in our research.

We have some trainings scripts under exp/dmoe that you should be able to adapt pretty easily! Just change the distributed arguments to set up for the number of nodes you want to run on.

I will try! Thanks a lot!