baaivision/JudgeLM

Issue running single judgement with references

sersoage opened this issue · 0 comments

Hi guys first of all thank you for the great paper. I am trying the single case scenario that is where i have a question, a model generated answer and a reference answer. Looking at the code i am using gen_model_judgment_single.py.
The first thing i did was to generate the dataset in the desired format:
"question_id": i,
"question_body": question["question"],
"decoding_method": "top_p_sampling", # Placeholder value
"model": "alpaca-native", # Placeholder value
"text": answer,
"scores": {"logprobs": -7.0179795026779175} #placheholder
I ase generated the reference answer dataset like this
combined_entry = {
"question_id": i,
"question_body": question["question"],
"decoding_method": "top_p_sampling", # Placeholder value
"model": "alpaca-native", # Placeholder value
"reference": {
"text": answer # You can update this with the correct reference text
},
"scores": {
"logprobs": -7.0179795026779175 # place holder
}
}
Then as stated in the repo i ran the judgelm_preprocess.py which generated a json with the following format
{"question_id": 0, "score": [{"logprobs": -7.0179795026779175}, {"logprobs": -7.0179795026779175}], "question_body": "question", "answer1_body": " generated answer, "answer2_body": "reference answer", "answer1_model_id": "alpaca-native", "answer2_model_id": "alpaca-native", "answer1_metadata": {"decoding_method": "top_p_sampling"}, "answer2_metadata": {"decoding_method": "top_p_sampling"}}
First question is it ok for the answer2body to be the reference answer?

Then having this dataset a run:
!python ./judgelm/llm_judge/gen_model_judgement_single.py
--model-path "BAAI/JudgeLM-7B-v1.0"
--model-id 7b-full-model
--question-file /root/JudgeLM/judgelm/data/judgelm-val-5k-judge-samples.jsonl
--answer-file /root/JudgeLM/judgelm/data/JudgeLM/output
--num-gpus-per-model 1
--num-gpus-total 1
--temperature 0
--reference-file /root/JudgeLM/judgelm/data/JudgeLM/combined_questions_answers_ref.jsonl
--if-fast-eval 1
First issue i run into was that since i was using references the copy function of conversation requests num_answers but this is a single one so i had to change the code to add this line
conv = conv_judge_single.copy() if references is None else conv_judge_single_w_reference.copy()
to this line
conv = conv_judge_single.copy() if references is None else conv_judge_single_w_reference.copy(answer_num=answer_num_value)
passing 1 as answer_num_value
So i do not know if this is a bug, if my change is ok?
After changing this a get the code to run however I do not see any judgment on the output, here is a sample output:
{"question_id": 0, "score": [{"logprobs": -7.0179795026779175}, {"logprobs": -7.0179795026779175}], "question_body": "question", "answer1_body": " generated_answer", "answer2_body": reference_answer", "answer1_model_id": "alpaca-native", "answer2_model_id": "alpaca-native", "answer1_metadata": {"decoding_method": "top_p_sampling"}, "answer2_metadata": {"decoding_method": "top_p_sampling"}, "pred_id": "ie5CkG9JTxcCYmAwt3pwrj", "pred_text": "10", "pred_model_id": "7b-full-model", "tstamp": 1703790064.0357897, "reference": "reference_anwer"}
I was wondering if you could help me to properly run this code and point anything i am doing wrong
Best
Sergio