baaivision/JudgeLM

preprocess

Closed this issue · 6 comments

Hi! I don't see the preprocessing script. Interested in using this as a metric to optimize my data gen by comparing to GPT-4 (1 response evaluation).

When I look in the readme, I see a reference to a judge_preprocess.py script, but I can't find where that is. How are you guys handling that for the different cases described in the paper? Is there a unifying function that preprocessed based on mode?

Fantastic work, by the way. Crazy this hasn't blown up yet.

Thanks for your interest! We have uploaded the preprocess scripts. Or you can directly download our uploaded dataset collection and put the contents in /JudgeLM/judgelm/data for easy use more samples.

Hey @Unrealluver, thanks for your response.

Can you please also share whether you tried PEFT approaches for fine- tuning and if you did, what was the results for those and also WRT dataset size.

Also, did you test with Classification / Regression Head if you were concerned about the scores of two answers?

Hey @Unrealluver, thanks for your response.

Can you please also share whether you tried PEFT approaches for fine- tuning and if you did, what was the results for those and also WRT dataset size.

Also, did you test with Classification / Regression Head if you were concerned about the scores of two answers?

Thanks for your questions.

  • We plan to update the PEFT approaches in November.
  • The JudgeLM first produces the scores of answer pairs in a natural language inference way. Then, we get the compared results Answer 1 win / Answer 2 win / Tie from the scores pair. Last, we use the GPT-4 generated compared results as GT to calculate metrics as shown in our paper. Furthermore, we also tried the 0-shot performance of JudgeLM with the PandaLM benchmark.

Hi @Unrealluver ! Thank you for publishing both! I looked through the judgelm/judgelm/data scripts, and found how to do local inference. However, that's assuming the workload comes in all at once. I'd like to use this in the middle of my pipeline, for evaluating model results as they come in for instant feedback.

What's the best way to do that? I'd guess something like VLLM or TGI, but unsure if you've come up with anything better. Would adapting the web app help?

Hey @darinkishore , thanks for your suggestions!

We are also trying to explore a more convenient way to plug JudgeLM into the model's training pipeline.

I will close this issue. If there are more questions, you are welcome to raise issues :)