A simple tested pytorch implementation of llama3 without fairscale.
Not only simpler but 25% faster than the original from meta: https://github.com/meta-llama/llama3
If you want to understand the transformer model I recommend you to read my implementation of a vanilla transformer, since I rehuse some code here.
Soon I will be adding an explanation of the Roformers.