Integration of llama3.1 fixes
Feelas opened this issue · 17 comments
Quick question: when is an update to an optimum-habana version which includes huggingface/optimum-habana#1154 (fix for rope_scaling @ llama3.1 family) planned?
I will look into this today as well see also
HabanaAI/vllm-fork#140
I staged some changes on my local repo, and when the PR request for optimum is finished, i will update my fork and make a PR to update the dependencies.
I created a fork and was able to get llama3.1 8b instruct working, but it reports that some of the token id's are wrong, but the inference appears to work correctly, see e.g.
#199 (comment)
https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer
However for the moment being I have not yet gotten llama 3.1 405b fp8 working
I created a fork and was able to get llama3.1 8b instruct working, but it reports that some of the token id's are wrong, but the inference appears to work correctly, see e.g.
#199 (comment) https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer
However for the moment being I have not yet gotten llama 3.1 405b fp8 working
Hi
I'm trying to run llama3.1_8b with your repo https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer, but meet some issues when building docker, seems the packages are conflict, could you please help to have a look? Thanks a lot!
docker build -t tgi_gaudi_llama3.1 .
I created a fork and was able to get llama3.1 8b instruct working, but it reports that some of the token id's are wrong, but the inference appears to work correctly, see e.g.
#199 (comment) https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer
However for the moment being I have not yet gotten llama 3.1 405b fp8 workingHi I'm trying to run llama3.1_8b with your repo https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer, but meet some issues when building docker, seems the packages are conflict, could you please help to have a look? Thanks a lot!
docker build -t tgi_gaudi_llama3.1 .
this branch is for debugging, there was a push today, huggingface/optimum-habana#1163 (comment) i will make a new docker container based on the new push
I created a fork and was able to get llama3.1 8b instruct working, but it reports that some of the token id's are wrong, but the inference appears to work correctly, see e.g.
#199 (comment) https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer
However for the moment being I have not yet gotten llama 3.1 405b fp8 workingHi I'm trying to run llama3.1_8b with your repo https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer, but meet some issues when building docker, seems the packages are conflict, could you please help to have a look? Thanks a lot!
docker build -t tgi_gaudi_llama3.1 .
this branch is for debugging, there was a push today, huggingface/optimum-habana#1163 (comment) i will make a new docker container based on the new push
Thanks a lot! Can I ask when will the new TGI docker container ready? I may want to directly try that one~
I have fixed the dependencies and built the docker container
I have fixed the dependencies and built the docker container
Great!! Where can I found the ready docker container, is there a link in dockerhub? Thanks a lot!
I just pushed it to endomorphosis/tgi_gaudi as per your request
Note:
There is not yet a formal release in huggingface/optimum and huggingface/optimum_habana so it uses git for the python dependencies.
I have not yet fixed the quantization bug present in huggingface/optimum_habana json configuration key mismatch, and i have not yet validated whether I can quantize llama 3.1 405B with a single node using parameter offloading, nor do I have multiple gaudi machines to quantize the llama 405b for habana, and llama 3.1 405B fp8 huggingface repository will load weights as bf16 right now.
Please inquire with the OPEA team whether they can assist me with the quantization effort, so that I can subsequently then try to add speculative decoding with llama 3.1 8b as the draft model.
I just pushed it to endomorphosis/tgi_gaudi as per your request
Note: There is not yet a formal release in huggingface/optimum and huggingface/optimum_habana so it uses git for the python dependencies.
I have not yet fixed the quantization bug present in huggingface/optimum_habana json configuration key mismatch, and i have not yet validated whether I can quantize llama 3.1 405B with a single node using parameter offloading, nor do I have multiple gaudi machines to quantize the llama 405b for habana, and llama 3.1 405B fp8 huggingface repository will load weights as bf16 right now.
Please inquire with the OPEA team whether they can assist me with the quantization effort, so that I can subsequently then try to add speculative decoding with llama 3.1 8b as the draft model.
Thanks a lot for your docker container, I will download and have a check~
For the quantization, here are the OPEA team members @changwangss @thuang6 @kevinintel who are responsible for this part, maybe you can consult them~
I haven't tested it in a while, I gave up on trying to get llama405b on a single node because of the dependency problems, that come along with using any method of quantization, but I assume that any half precision models should work.
@regisss do you think that having #222 integrated should also fix this one? I don't have the time to test it now, but would be good to close if it should work now, thanks.
I think it should work but I have not tried it yet. @tthakkal Have you already tried to run Llama 3.1?
We tested Llama3.1-8B and Llama3.1-70B bf16 and fp8
Llama3.1-8B on 1 card
https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#llama31-8b-on-1-card
https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#llama31-8b-on-1-card-1
Llama3.1-70B 8 cards
https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#llama31-70b-8-cards
https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#llama31-70b-on-8-cards
you shouldn't need 8 cards, two cards is sufficient.
you shouldn't need 8 cards, two cards is sufficient.
That could work, just haven't tested it