Support for LLaMA-2
ayyyq opened this issue · 15 comments
Hi, nice work! I would like to know which parts of the code you have modified in transformers-4.28.1, and how can I support LLaMA-2?
I am wondering that too.
A non-exhaustive list of the changes:
-
(
dola_greedy_decode
, which replacessample()
) -
(
forward
)
Hi,
Sorry for the late reply. The files changed are the above three files mentioned by @garyfanhku
We are currently trying to include DoLa in to latest huggingface transformer package. Please stay tuned!
I have merged the DoLa decoding into the new version (4.39.0.dev0
) of transformers package.
Install it here: https://github.com/voidism/transformers-dola
Follow the instructions here for decoding: https://github.com/voidism/transformers-dola/blob/main/docs/source/en/generation_strategies.md#dola-decoding
This should support LLaMA2 and new models including Mistral or Gemma.
Hi @voidism Thank you for the pull request and plan to support Llama 2. I have setup things and was trying out things with the default model provided and it works as expected. However, when I shift to Llama v2-70b (in a multi GPU setting), I run into this error
src/transformers/generation/utils.py", line 2066, in dola_decoding
softmax_mature_layer[None, :, :] + softmax_premature_layers
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7!
Given that my layers are being split across GPU machines (which is an expected use case) would you have a suggestion on what I could be doing to fix this? Thank you
Hi @naveenjafer
It's weird because I didn't have this error for llama-1 65B. I think you can simply move the softmax_mature_layer
to the same device of softmax_premature_layers
. I will try to fix at this issue later.
Hey @voidism Thank you for getting back! I assume the Llama 1 model should have split the layers across GPU nodes too, given how similar the memory requirements are. I will look into this too later today, thank you!
Hi @naveenjafer
Yes! When doing the experiments in my paper, I used to run the llama-1 70B model with 8 V100 GPUs and they worked well. Not sure if this issue is due to some difference between llama 1 and llama 2.
Hi, i have tried the installing the above transformer package and following the example in https://github.com/voidism/transformers-dola/blob/main/docs/source/en/generation_strategies.md#dola-decoding
for mistralai/Mistral-7B-v0.1
, but i don't see any difference in outputs between greedy decoding and setting dola_layers=high
at all.
Hi @wj210
Can you also try dola_layers=low
for me? I haven't intensively tested mistral models but different models may have different properties between the layers, so maybe dola_layers=high
does not contrast that much in mistral for this example. You can also try something like dola_layers=[6,8,10]
and see whether the output changes.
I will try to examine if there are any issues with mistral models if none of the dola_layers
work! Just let me know!
Hi,
heres the code i tried:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
device = 'cuda'
model.to(device)
set_seed(42)
text = "On what date was the Declaration of Independence officially signed?"
inputs = tokenizer(text, return_tensors="pt").to(device)
# Vanilla greddy decoding
vanilla_output = model.generate(**inputs, do_sample=False, max_new_tokens=50)
vanilla_output = tokenizer.batch_decode(vanilla_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print (vanilla_output)
# DoLa decoding with contrasting higher part of layers (layers 16,18,...,30)
dola_high_output = model.generate(**inputs, do_sample=False, max_new_tokens=50, dola_layers=[6,8,10])
dola_high_output = tokenizer.batch_decode(dola_high_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print (dola_high_output)
and i got ["\n\nThe Declaration of Independence was officially signed on August 2, 1776. However, it's important to note that not all the delegates signed it on that date. The signing of the Declaration of Independ"]
for both.
I tried both high,low, different layers, it still yield the same results.
i even tried tinyllama, a 1.1B version of llama and still no changes TinyLlama/TinyLlama-1.1B-Chat-v1.0
.
Also, a side note, after installing from https://github.com/voidism/transformers-dola
my transformer version is 4.40.0.dev0
, could it be due to the package version difference?
ok, it seems that using dola_layers > layer 18 (ie 2 layers above the mid layer) would produce different generations, surprisingly, adding the 16th layer in which is equal to setting the layers to 'high' would yield the same result as greedy decoding.
Using lower layers does not work too. Is there any reason for this behavior?
Hi @wj210
Thanks for testing this! As different models have different distributions of knowledge stored in their layers, it is reasonable to adjust the selected layer range for new models.
Also, this example of "Declaration of Independence" is picked from TruthfulQA, which contains mainly short-sentence answers with dense factual knowledge. In my experiment, TruthfulQA tends to require contrasting with higher parts of the layers to get improvements. However, for most of the other tasks with longer responses for reasoning, e.g. GSM8K and StrategyQA, contrasting with lower parts of the layers would help more.
It seems that the original code from the main branch works with llama2? I am working with llama2 on transformers-4.28.1 from the main branch.
Hey, I am trying to run the code on mistral, but it isn't supported in transformers 4.28-1. Is there any way to use this code base with mistral, in particular the TruthfulQA evaluation (tfqa_mc_eval.py)?