Integration of llama3.1 fixes

Question

Integration of llama3.1 fixes

Feelas opened this issue 5 months ago · 17 comments

Quick question: when is an update to an optimum-habana version which includes huggingface/optimum-habana#1154 (fix for rope_scaling @ llama3.1 family) planned?

Answer 1 · 2024-07-31T01:53:31.000Z

I will look into this today as well see also
HabanaAI/vllm-fork#140

Answer 2 · 2024-07-31T04:12:43.000Z

I staged some changes on my local repo, and when the PR request for optimum is finished, i will update my fork and make a PR to update the dependencies.

Answer 3 · 2024-08-06T08:42:15.000Z

I created a fork and was able to get llama3.1 8b instruct working, but it reports that some of the token id's are wrong, but the inference appears to work correctly, see e.g.

#199 (comment)
https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer

However for the moment being I have not yet gotten llama 3.1 405b fp8 working

Answer 4 · 2024-08-08T02:48:33.000Z

I created a fork and was able to get llama3.1 8b instruct working, but it reports that some of the token id's are wrong, but the inference appears to work correctly, see e.g.

#199 (comment) https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer

However for the moment being I have not yet gotten llama 3.1 405b fp8 working

Hi
I'm trying to run llama3.1_8b with your repo https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer, but meet some issues when building docker, seems the packages are conflict, could you please help to have a look? Thanks a lot!

docker build -t tgi_gaudi_llama3.1  .

Answer 5 · 2024-08-08T02:55:11.000Z

I created a fork and was able to get llama3.1 8b instruct working, but it reports that some of the token id's are wrong, but the inference appears to work correctly, see e.g.
#199 (comment) https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer
However for the moment being I have not yet gotten llama 3.1 405b fp8 working

Hi I'm trying to run llama3.1_8b with your repo https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer, but meet some issues when building docker, seems the packages are conflict, could you please help to have a look? Thanks a lot!
docker build -t tgi_gaudi_llama3.1  . 

this branch is for debugging, there was a push today, huggingface/optimum-habana#1163 (comment) i will make a new docker container based on the new push

Answer 6 · 2024-08-08T03:13:17.000Z

I created a fork and was able to get llama3.1 8b instruct working, but it reports that some of the token id's are wrong, but the inference appears to work correctly, see e.g.
#199 (comment) https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer
However for the moment being I have not yet gotten llama 3.1 405b fp8 working

Hi I'm trying to run llama3.1_8b with your repo https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer, but meet some issues when building docker, seems the packages are conflict, could you please help to have a look? Thanks a lot!
docker build -t tgi_gaudi_llama3.1  . 
this branch is for debugging, there was a push today, huggingface/optimum-habana#1163 (comment) i will make a new docker container based on the new push

Thanks a lot! Can I ask when will the new TGI docker container ready? I may want to directly try that one~

Answer 7 · 2024-08-08T03:17:59.000Z

I'll start working on it in about 45 minutes

…

On Wed, Aug 7, 2024 at 8:13 PM XinyaoWa ***@***.***> wrote: I created a fork and was able to get llama3.1 8b instruct working, but it reports that some of the token id's are wrong, but the inference appears to work correctly, see e.g. #199 (comment) <#199 (comment)> https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer However for the moment being I have not yet gotten llama 3.1 405b fp8 working Hi I'm trying to run llama3.1_8b with your repo https://github.com/endomorphosis/tgi-gaudi/tree/endomorphosis/llama3.1_tokenizer, but meet some issues when building docker, seems the packages are conflict, could you please help to have a look? Thanks a lot! docker build -t tgi_gaudi_llama3.1 . [image: image] <https://private-user-images.githubusercontent.com/82487983/356062282-211fde63-7084-4803-a6c8-9c10d4b4540f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMwODU5NjIsIm5iZiI6MTcyMzA4NTY2MiwicGF0aCI6Ii84MjQ4Nzk4My8zNTYwNjIyODItMjExZmRlNjMtNzA4NC00ODAzLWE2YzgtOWMxMGQ0YjQ1NDBmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODA4VDAyNTQyMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNhNDBkNTgwOTE2ZmI4ZjhlNmUwMTVlOGNmYTUyZWQ1OWM2ZThjZjI1NTc1YmJkMGZiYjRjNmI1MjNiZGU5ZGYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.-NrxXmeF2d7E16ExYxidf_6Kv3JidjKbSEuCs5tZM_A> this branch is for debugging, there was a push today, huggingface/optimum-habana#1163 (comment) <huggingface/optimum-habana#1163 (comment)> i will make a new docker container based on the new push Thanks a lot! Can I ask when will the new TGI docker container ready? I may want to directly try that one~ — Reply to this email directly, view it on GitHub <#197 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZ7LEW62JFPQOAJEWYWE4DZQLO6HAVCNFSM6AAAAABLUQ6Z7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZUHA3TCMRUGY> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 8 · 2024-08-08T06:11:16.000Z

I have fixed the dependencies and built the docker container

Answer 9 · 2024-08-08T06:15:46.000Z

I have fixed the dependencies and built the docker container

Great!! Where can I found the ready docker container, is there a link in dockerhub? Thanks a lot!

Answer 10 · 2024-08-08T06:24:49.000Z

I just pushed it to endomorphosis/tgi_gaudi as per your request

Note:
There is not yet a formal release in huggingface/optimum and huggingface/optimum_habana so it uses git for the python dependencies.

I have not yet fixed the quantization bug present in huggingface/optimum_habana json configuration key mismatch, and i have not yet validated whether I can quantize llama 3.1 405B with a single node using parameter offloading, nor do I have multiple gaudi machines to quantize the llama 405b for habana, and llama 3.1 405B fp8 huggingface repository will load weights as bf16 right now.

Please inquire with the OPEA team whether they can assist me with the quantization effort, so that I can subsequently then try to add speculative decoding with llama 3.1 8b as the draft model.

Answer 11 · 2024-08-08T07:44:44.000Z

I just pushed it to endomorphosis/tgi_gaudi as per your request

Note: There is not yet a formal release in huggingface/optimum and huggingface/optimum_habana so it uses git for the python dependencies.

I have not yet fixed the quantization bug present in huggingface/optimum_habana json configuration key mismatch, and i have not yet validated whether I can quantize llama 3.1 405B with a single node using parameter offloading, nor do I have multiple gaudi machines to quantize the llama 405b for habana, and llama 3.1 405B fp8 huggingface repository will load weights as bf16 right now.

Please inquire with the OPEA team whether they can assist me with the quantization effort, so that I can subsequently then try to add speculative decoding with llama 3.1 8b as the draft model.

Thanks a lot for your docker container, I will download and have a check~
For the quantization, here are the OPEA team members @changwangss @thuang6 @kevinintel who are responsible for this part, maybe you can consult them~

Answer 12 · 2024-09-17T09:17:27.000Z

@regisss do you think that having #222 integrated should also fix this one? I don't have the time to test it now, but would be good to close if it should work now, thanks.

Answer 13 · 2024-09-17T09:28:24.000Z

I haven't tested it in a while, I gave up on trying to get llama405b on a single node because of the dependency problems, that come along with using any method of quantization, but I assume that any half precision models should work.

Answer 14 · 2024-09-20T08:26:11.000Z

@regisss do you think that having #222 integrated should also fix this one? I don't have the time to test it now, but would be good to close if it should work now, thanks.

I think it should work but I have not tried it yet. @tthakkal Have you already tried to run Llama 3.1?

Answer 15 · 2024-09-20T16:09:13.000Z

@regisss do you think that having #222 integrated should also fix this one? I don't have the time to test it now, but would be good to close if it should work now, thanks.

I think it should work but I have not tried it yet. @tthakkal Have you already tried to run Llama 3.1?

We tested Llama3.1-8B and Llama3.1-70B bf16 and fp8
Llama3.1-8B on 1 card
https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#llama31-8b-on-1-card
https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#llama31-8b-on-1-card-1

Llama3.1-70B 8 cards
https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#llama31-70b-8-cards
https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#llama31-70b-on-8-cards

Answer 16 · 2024-09-20T19:20:39.000Z

you shouldn't need 8 cards, two cards is sufficient.

Answer 17 · 2024-09-20T19:28:10.000Z

you shouldn't need 8 cards, two cards is sufficient.

That could work, just haven't tested it