Question Regarding the Experimental Setup for the Reported Results

Question

Question Regarding the Experimental Setup for the Reported Results

5456es opened this issue 4 months ago · 8 comments

Hi!

This is an outstanding piece of work—not only as a user-friendly, integrated tool but also as a comprehensive analysis of Knowledge Editing for LLMs!

While reading your paper, "A Comprehensive Study of Knowledge Editing for Large Language Models," I found myself a bit unclear about the detailed experimental setup. For example, when reviewing the result tables, I was curious about specifics such as the number of edits, method-specific parameters (e.g., layers, v_num_grad_steps in ROME), and the selection of edit data (although the source is noted, was the data chosen randomly, and were the results averaged across multiple tests?).

Having a complete description of the experimental settings that correspond to the reported results in your paper would be incredibly helpful—not only for me but also for future readers and researchers looking to build on this work.

Looking forward to your response! :)

Answer 1 · 2024-10-28T07:19:12.000Z

Thank you very much for your interest in EasyEdit! Our parameters are the default hyperparameters in EasyEdit. We did not modify them; everything was run according to the parameters specified in the original method.

Answer 2 · 2024-10-28T07:54:32.000Z

OK i see, thanks!

Answer 3 · 2024-10-30T06:45:20.000Z

Apologies for the interruption, but could I confirm the experiment setting for the Edit Succ metric in section 4.2 Main Results? From my understanding, it involves setting edit_times=1, with your team sampling edit data multiple times from the specified sources and then averaging the results. Is that correct?

Answer 4 · 2024-10-30T09:24:52.000Z

Yes, your understanding is correct. We only edit the model once and then average the results across the entire dataset.

Answer 5 · 2024-11-01T06:05:55.000Z

After the code has been updated, the results in the paper are outdated and we're currently checking the results, but it seems to have some random results. Please follow #390 to keep the update.

Answer 6 · 2024-11-09T03:24:53.000Z

I think the problem is solved and we will update the arxiv this week. The setting to get the result is the same as the hparam files in our repository and the data we used is the KnowEdit data in the huggingface.
You can just check the run_knowedit_llama2.py to see how we compute the results.
Here, we edit each case in the test json and average them.

Leave any message if you have further question and please help us close the issue if it pans out.

Answer 7 · 2024-11-11T02:39:55.000Z

I close this issue and you can reopen if you have further questions.

Answer 8 · 2024-11-18T01:25:20.000Z

Dear 5456es,

We have fixed the bug and will update the paper on arXiv tomorrow (the README has been updated). We have written a pinned issue statement explaining the cause of this issue and included an announcement in the News. We apologize for any inconvenience caused by this bug.

The following is a statement.

Dear all:

Recently, with the help from the community (special thanks to @StarLooo), we will update the KnowEdit results (Llama2-7b-chat) in Table 4 of the paper ‘A Comprehensive Study of Knowledge Editing for Large Language Models’. Overall, the results have improved, primarily due to the following reasons:

1. AdaLoRA Optimization: we follow the FT-M instead of the FT-L, which trains the same FFN layer as FT-L using cross-entropy loss on the target answer while masking the original text. This approach not only yields better results but also highlights the optimal performance of AdaLoRA. Meanwhile, the peft version would also affect the performance.

2. ROME and MEMIT Updates: The results are updated after identifying missing components in the local version of the Llama2-7b-chat files (specifically, the legacy feature in the tokenizer_config.json). If you are using the official Llama2-7b-chat model downloaded directly from HF, this issue should not affect your results. Also, we fix a bug related to the padding_size for these two methods, which will influence the performance when you compute the results for batch inputs.

We deeply apologize for any inconvenience caused by this bug.

We will continue improving EasyEdit, updating this paper, and welcome everyone to engage in discussions and share ideas.

EasyEdit Team