Unable to reproduce the scores from the paper using the scripts you provide.

Question

Unable to reproduce the scores from the paper using the scripts you provide.

kanseaveg opened this issue a year ago · 12 comments

Hello, I tried to run the results of StructLM-7B using the script you provided, but the results are clearly inconsistent with the data in the paper.

The reproduced log and the reproduced summary are as follows.
StructLM-7B-eval.log
summary.json

Additionally, the generated prediction file is too large, so it won't be uploaded here.

I'm using the same environment as stated in the readme, with an A800 graphics card. Why is there a significant discrepancy between my reproduced results and those reported in the paper?

Additionally, in order to better identify the cause, I am providing the hash values of the weights and data that I downloaded from Huggingface.

Answer 1 · 2024-03-16T15:36:02.000Z

Thanks for your comment. Let me reproduce this. I have a feeling this should be a small issue since most datasets experience no performance degradation.

Answer 2 · 2024-03-17T01:43:26.000Z

In addition, when I try structLM-13B, I find that loading will also appear error, please also help to take a look.

Answer 3 · 2024-03-17T21:55:30.000Z

Hi, I'm able to load structLM-13B and begin the prediction process. Can you send me a log of your error?

I also just updated the repo to use a longer input token length, since the one specified in the eval script was too small

Also, I reproduced your StructLM-7B result. I'm investigating further.

I evaluated StructLM-13B, and it closely reproduces the results in the paper.

Answer 4 · 2024-03-20T02:35:43.000Z

@azhx I'm really sorry that it cannot be used because of environmental problems, but now it can be run.
If there is a huge discrepancy in the experimental results evaluated, I will still give feedback on this issue.
Thank you for your patient answer. By the way, have you found the reason for the inconsistent results of your Struct-7B?

Answer 5 · 2024-03-20T07:25:09.000Z

For StructLM-13B, the scores evaluated by the breakpoint can all be correct, and there is no much difference. What I am curious about is that your average calculation is very different from the average in the paper. I suspect that there is a deviation between the calculation method and the calculation method of the average value in the code, remember to modify it. In addition, I would like to ask what problems exist in the effect of StructLM-7B again, and whether I need to provide the prediction file predictions_predict.json to you. Thank you for your patient answer and look forward to your reply again.

Answer 6 · 2024-03-20T14:34:23.000Z

What I am curious about is that your average calculation is very different from the average in the paper

Hi, the average calculation in the paper is just over the data reported in the main table. As you can see here, there are more than one metric reported for a many datasets, and those are less important. We omit them from the average. The main table averages are calculated the same way, so they are comparable. The metrics that we use in the main table are specified there.

what problems exist in the effect of StructLM-7B

At this point i suspect there is an issue with the checkpoint that was uploaded. The resources are tight on our cluster and I'm waiting for it to be freed up soon. I hope to solve this issue asap. Thanks for your continued interest, and I'll update this thread as soon as I make a change

Answer 7 · 2024-03-20T15:13:26.000Z

Thank you for your patient reply and look forward to the verification result of your 7B

Answer 8 · 2024-03-22T01:45:57.000Z

Hi, the result of 34B is also out, which is completely consistent with that of the paper.
I suspect that the gap in the Struct-7B may be due to the original llamatrainer training.
Anyway, thanks for your patience and help.

In addition, I found a curious phenomenon. For example, in the spider dataset, the results of 13B and 34B were surprisingly consistent.
Normally, 34B should be more emergent than 13B, but the effectiveness of instruct does not seem to be improved.
I think this is a question worth thinking about. @azhx

Answer 9 · 2024-03-22T04:43:49.000Z

Hi, thanks for indicating the issue is resolved, but I will continue to update this thread when I am able to address the 7B issue. Sorry about the delay on that and thanks for your patience.

For example, in the spider dataset, the results of 13B and 34B were surprisingly consistent.

Yes, my guess about this result is that data is more important than scaling model size for performance on this niche.

Answer 10 · 2024-04-09T03:31:35.000Z

Hi @kanseaveg
We have a new 7B model trained on Mistral that perfoms better than the original 7B codellama model and reproduces. please see https://huggingface.co/TIGER-Lab/StructLM-7B-Mistral/.

Note that there is a different prompting format, so you should use the file at to run the evaluation
https://huggingface.co/datasets/TIGER-Lab/SKGInstruct/blob/main/skginstruct_test_file_mistral.json

let me know if you have any questions

Answer 11 · 2024-04-10T14:20:17.000Z

Thank you for your contribution to the community~ @azhx
BTW. May I ask what is the difference between the instruct data of codellama and that of mistral?

Answer 12 · 2024-04-10T20:40:54.000Z

The training data we used this time omits the slimorca, and it was trained with a different prompt format. You can check the model card for details