AkariAsai/self-rag

Cannot approach the performance of the uploaded self-rag ckpt when finetuning meta/Llama-2 myself

HazekiahWon opened this issue · 5 comments

Thanks for your inspiring work @AkariAsai .

I tried to run the script_finetune_7b.sh script myself (using meta-llama/Llama-2-7b-hf and your provided generator data), which is expected to produce a ckpt that aligns with the uploaded ckpt in performance now that both the base ckpt and data align.

However the resulting model shows a significant performance gap wrt the uploaded self-rag checkpoint. For example, on triviaQA, my ckpt has only 0.503 acc compared with 0.679 the uploaded ckpt.

I do notice that there is difference between the final checkpoint dir and the uploaded checkpoint dir:

  1. my reproduction ckpt are saved in *.savetensors, however the uploaded ckpt in *.bin.
  2. I encounter the same issue as mentioned in #21. The checkpointing stores both the single checkpoint (but without embedding parameters) and sharded checkpoints (see the figure below). So #21 suggested model.safetensors be removed. I guess you did not encounter such issue.
image

I was wondering whether the underlying cause of such difference might result in this performance gap. Do you have any idea regarding this matter?

It seems like I have the same question, I would appreciate it if there is any possible solution! @HazekiahWon @AkariAsai

Hello , I encountered the same problem. I wonder how do you load the model. I have trained a model and get the same files, which have model.safetensors . I When I try to load the model to eval by running run_short_form.py , this problem occurs. I delete the model.safetensors as #21 suggested but I didn't solve the problem. @HazekiahWon @Jack-ZC8 @AkariAsai

File "run_short_form.py", line 302, in main
model = LLM(model=gpt, download_dir=args.download_dir,
File "/home/xxx/anaconda3/envs/rag/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 105, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/home/xxx/anaconda3/envs/rag/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 250, in from_engine_args
engine = cls(*engine_configs,
File "/home/xxx/anaconda3/envs/rag/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 110, in init
self._init_workers(distributed_init_method)
File "/home/xxx/anaconda3/envs/rag/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 146, in _init_workers
self._run_workers(
File "/home/xxx/anaconda3/envs/rag/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 755, in _run_workers
self._run_workers_in_batch(workers, method, *args, **kwargs))
File "/home/xxx/anaconda3/envs/rag/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 729, in _run_workers_in_batch
output = executor(*args, **kwargs)
File "/home/xxx/anaconda3/envs/rag/lib/python3.8/site-packages/vllm/worker/worker.py", line 79, in load_model
self.model_runner.load_model()
File "/home/xxx/anaconda3/envs/rag/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 57, in load_model
self.model = get_model(self.model_config)
File "/home/xxx/anaconda3/envs/rag/lib/python3.8/site-packages/vllm/model_executor/model_loader.py", line 72, in get_model
model.load_weights(model_config.model, model_config.download_dir,
File "/home/xxx/anaconda3/envs/rag/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 340, in load_weights
weight_loader(param, loaded_weight)
File "/home/xxx/anaconda3/envs/rag/lib/python3.8/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 80, in weight_loader
assert loaded_weight.shape[parallel_dim] == self.num_embeddings
AssertionError

I encountered the problem (assert loaded_weight.shape[parallel_dim] == self.num_embeddings), but I solved it by deleting the file model.safetensors ...

I encountered the problem (assert loaded_weight.shape[parallel_dim] == self.num_embeddings), but I solved it by deleting the file model.safetensors ...

Thanks, I solved this problem the same!

@HazekiahWon Hello~HazekiahWon
Recently, I also trained selfrag-7b based on llama2. I encountered the same problem as you: using the training data and scripts provided by selfrag, I obtained selfrag-7b-myversion by fine-tuning llama2-7b. When I evaluated selfrag-7b-myversion, I found that its performance metrics were not as good as the officially released selfrag-7b. I saw that you provided the evaluation results of your fine-tuned model on the TQA dataset and the acc is 0.503. Could you please share the results of your model on the PopQA, Archallenge, and PubHealth datasets as well? Thank you very much, and I look forward to your reply.