OpenMOSS/CoLLiE

Support for LLaMA-2 70B with Grouped-Query Attention

Opened this issue · 18 comments

Due to the Grouped-Query Attention introduced in LLaMA-2 70B,llama issue,it cannot be loaded into the collie implementation of LLaMA. Hope LLaMA-2 70B can be support in collie. Thanks

Traceback (most recent call last):
  File "/nvme1/gptdata/share1/projects/collie/examples/download.py", line 49, in <module>
    model = LlamaForCausalLM.from_pretrained(model_name, config=config)
  File "/nvme1/gptdata/share1/app/mambaforge/envs/collie/lib/python3.9/site-packages/collie/models/base.py", line 306, in from_pretrained
    state_dict = cls.load_parallel_state_dict(
  File "/nvme1/gptdata/share1/app/mambaforge/envs/collie/lib/python3.9/site-packages/collie/models/llama/model.py", line 414, in load_parallel_state_dict
    part_state_dict[key] = rearrange(
RuntimeError: shape '[8192, 8192]' is invalid for input of size 8388608

I got the same error for LLaMa2 70B

@kaiwang13 Could you please share how you resolved the issue?

@kaiwang13 Could you please share how you resolved the issue?

Just uninstall the old version and install the latest one from source code.

I just cloned the repo and installed it from the main branch. But I'm still facing the error. Do I need to install it from any specific branch?

I just cloned the repo and installed it from the main branch. But I'm still facing the error. Do I need to install it from any specific branch?

Remove https://github.com/OpenLMLab/collie/blob/c9cc0055a52b96d156450b5734a0a1d0dbde4562/collie/models/llama/model.py#L425C1-L432C64

This resolved the issue of the shape error. But now it seems offloading to CPU memory and the process get killed because of CPU overloading. I have 450 GB of CPU memory and 4*A100 80 GB. I'm using the LOMO optimizer. Is this expected for LOMO?

This resolved the issue of the shape error. But now it seems offloading to CPU memory and the process get killed because of CPU overloading. I have 450 GB of CPU memory and 4*A100 80 GB. I'm using the LOMO optimizer. Is this expected for LOMO?

What I did cannot solve the problem. The pretrained state dict cannot be loaded for training without thos code.

okay, so is there any suggestion to solve the problem?

This resolved the issue of the shape error. But now it seems offloading to CPU memory and the process get killed because of CPU overloading. I have 450 GB of CPU memory and 4*A100 80 GB. I'm using the LOMO optimizer. Is this expected for LOMO?

What I did cannot solve the problem. The pretrained state dict cannot be loaded for training without thos code.

Do you mean that the error occurs when loading pretrained state dict? Could you please show the error log?
Sorry for the late reply

I was testing this with the main branch. While loading the state dict, the CPU is taking around 550GB. Yesterday I tried that in a larger instance with 900GB CPU and it got some shape error during the train start as @kaiwang13 mentioned. I don't have access to that machine right now to share the log.

However, yesterday I tested the dev branch as well. In the dev branch, the CPU was taking only around 150GB. But there, I was getting OOM while saving the checkpoint after the first epoch. You can check issue #98 about this.

Let me know if this info is enough for you to proceed further

I was testing this with the main branch. While loading the state dict, the CPU is taking around 550GB. Yesterday I tried that in a larger instance with 900GB CPU and it got some shape error during the train start as @kaiwang13 mentioned. I don't have access to that machine right now to share the log.

However, yesterday I tested the dev branch as well. In the dev branch, the CPU was taking only around 150GB. But there, I was getting OOM while saving the checkpoint after the first epoch. You can check issue #98 about this.

Let me know if this info is enough for you to proceed further

Thanks for your information! The latest llama-v2 is not updated to main branch yet, so it is normal to raise errors in main branch.
We will test the process of saving checkpoint later. Does OOM occur when using zero3 and LOMO when saving checkpoint? And is it caused by GPU OOM or CPU OOM?

Yes, I was using zero3 and LOMO. And I was getting GPU OOM while saving

Yes, I was using zero3 and LOMO. And I was getting GPU OOM while saving

Thanks a lot! We will try to fix it

Yes, I was using zero3 and LOMO. And I was getting GPU OOM while saving

Thanks a lot! We will try to fix it

@dittops @x54-729 Additionally, I tried training llama1-33b with a sequence length of 2048 and a batch size of 1 using AdamW with zero3 on 8xA100 80G. The training process went fine, but I encountered OOM when attempting to save the model.

We've found that the OOM problem is due to the parameter gathering process with DeepSpeed's API. And we plan to fix it by gathering parameters one by one.

@x54-729 Please let me know if you have pushed any updates on this to the dev branch. I can try it out.

@x54-729 Please let me know if you have pushed any updates on this to the dev branch. I can try it out.

Hi, the bug is fixed in dev branch, maybe you can have a try.

FYI: 82869ee ac6eed4

I have tested the code. I was able to train and save the model.

I was testing by training a small dataset that contains the identity(English) of the model. But on inference, the model started generating Chinese instead of English while generating identity-related text.

I was using 70B + LOMO + Stage 3 + transformer-4.32.1

I have tried encoding and decoding the training data with the tokenizer and that looks fine. Any thoughts on what could be the issue here?