rejection rate of chatGPT
valdesguefa opened this issue · 8 comments
We use two methods for evaluation: exact match and chatgpt evaluation. When evaluating rejections, since LLMs sometimes do not completely follow our requirements for rejection text, we need to use chatgpt to determine whether the model has rejected or not.
(1) All of the ChatGPT in this paper is the gpt-3.5-turbo api.
(2) Rej is measured by exact match. If th span -- insufficient information
is contained in the generation, this generation is regared as rejecting.
(3) Rej* is measure by ChatGPT. Although in the instruction, we ask LLMs generate 'I can not answer the question because of the insufficient information in documents.'
if the document does not contain the answer, LLMs can not always follow the instruction, and may generate some unexpected rejection sentences like The document does not provide information about xxx.
In this case, we use chatgpt to determine whether the generation can be regarded as rejecting.
In this case, if a model has rej=35% and rej*=45%, can we say that its rejection rate is 35+45=80%?
No, both rej and rej* are rejection rate, but they are obtained by different ways — exact match and chatgpt respectively.
if the generation contains "insufficient information", chatGPT will consider the generation as a reject. isn't there a risk that chatgpt will also count the rej in rej*?
Yes, so rej* is higher than rej. Rej may miss some rejection generation. We use chatgpt to obtain more precise rejection rate, i.e. rej* while human evaluation is better.