/llama3-8x8b-MoE

Copy the MLP of llama3 8 times as 8 experts , created a router with random initialization,add load balancing loss to construct an 8x8b MoE model based on llama3.

Primary LanguagePython

Stargazers