Tomiinek/MultiWOZ_Evaluation

How to generate the prediction file?

SkyAndCloud opened this issue · 10 comments

Hi, thanks for your great work. I wander how to generate the prediction results file. For example, there is no related scripts in UBAR code, how to generate the predicted results in the expected format?

Hello, I modified the original code to get outputs in the format. I unfortunately already deleted it, sorry 😥 , but it should not be difficult to do it. For example, UBAR dumps the predictions to a file, but in a different format, so just go through the inference script, construct a dictionary in the described format and then dump it into the json.

I will try to find the files again later this week, but I really think I do not have it now. Sorry once again

Got it. BTW, whether did you use the released UBAR checkpoint or train it yourself to generate the predictions? Because lots of people have failed in reproducing UBAR results (even using the released checkpoint). reference

Hmm, I see. I used the provided checkpoint which should give these results. But TBH I think that there is something messed up in the code base (like leaking ground-truth data during inference, teacher-forcing later predictions etc.), but I was not able to figure out what ... 🤐 It would be great if you were able to find out what is happening there.

I do not want to be rude, but I should have maybe ignore the model at all in the evaluation. I like SOLOIST much more.

I think the reported end-to-end results in UBAR paper is incorrect because they used the golden belief state to generate system actions & responses. We should add an option use_true_bspn_for_ctr_eval=False to do end-to-end evaluation in the right way.

Well, I'm figuring out whether there is something wrong (like teacher-forcing later predictions) in the code base. Have you already find something, or, if you have any suggestions? Maybe I should turn to SOLOIST. lol

I think I set use_true_bspn_for_ctr_eval=False too, but I still did not trust the results very much 😥

Well, I'm figuring out whether there is something wrong (like teacher-forcing later predictions) in the code base. Have you already find something, or, if you have any suggestions? Maybe I should turn to SOLOIST. lol

Is there a way to run UBAR in an interactive mode? 🤔

I wander whether the inputs of the old-version evaluation are generated response and generated belief states? generated system actions and generated db are not included?

Hmm, I see. I used the provided checkpoint which should give these results. But TBH I think that there is something messed up in the code base (like leaking ground-truth data during inference, teacher-forcing later predictions etc.), but I was not able to figure out what ... 🤐 It would be great if you were able to find out what is happening there.

I do not want to be rude, but I should have maybe ignore the model at all in the evaluation. I like SOLOIST much more.

I can not find how SOLOIST generate belief states db.
Could you give a link ?