Coqa pre-processing

Hey, great work and thanks for releasing the code!

For the coqa dataset preprocessing, I noticed that you don't prepend the previous (Q: .. A: ...) pairs to the prompt. The semantic entropy paper does prepend this for al (Q,A) pairs of a given story.

Is this intentional?
Thanks!

UQ-NLG/dataeval/coqa.py

Line 33 in a3600ba

for question_index, question in enumerate(questions):

Thanks for the question! I think this could be a legacy issue while I was using smaller models for experiments - concatenating all previous questions makes the input too long so I left out this step. I'm not sure if this will change the conclusion of the paper but I'm adding this step for more experiments as well.

Got it, thanks. I suspect it'll affect (decrease) the accuracy at least, since some of the questions are dependent on the earlier question/answer pairs.

Ah yes sorry for not being clear! Accuracy surely will be different, but I was more referring to which UQ metric does better etc

Hi, I also have a question regarding on the preprocessing of CoQA dataset. As CoQA dataset is a conversational QA dataset, the questions under a single story is consecutive and each depends on previous questions. For example:

Story: blah, blah, blah...
Q: question 1
A: answer 1

Q: question 2
A:

When answering question 2, it is possible that we must first given question 1. For example, question 2 is like "Why they do xxxxx", but to figure our which people "they" refers to, we need to go back to question 1.

In your preprocessing code, although you comments that the code comes from the repository of Semantic Uncertainty (https://github.com/zlin7/UQ-NLG/blob/main/dataeval/coqa.py#L13C1-L14C1), it seems some lines of codes are ignored. Leading to a preprocessing result like:

Story: blah, blah, blah...
Q: question 2
A:

where question 1 is dropped. The lines of codes that are dropped in your version is https://github.com/lorenzkuhn/semantic_uncertainty/blob/main/code/parse_coqa.py#L51C4-L51C4, compared to the original implementation of Semantic uncertainty.

Would this has a negative influence on the final performance of LLMs? As the background the previous questions are dropped. Thanks!

Thanks for the question. I actually just finished running the code with the conversation prepended this week (along with some other experiments), and realized that I actually used the generations with the previous questions prepended for all my experiments in the paper. I found that I have accidentally dropped the lines when I cleaned up and uploaded the code, so if you add these lines back I believe you will replicate our experiment results. If you actually run this version (without the previous conversation), the accuracy for llama should be only around 30% as opposed to 60+%.

I will also make this change along with some other things in the next update. Really sorry for the confusion (and for anyone who used this wrong version of the code)!

Just fixed this bug in case it is further propagated!