chen700564/RGB

rejection rate of chatGPT

valdesguefa opened this issue · 8 comments

in the article it says that gpt-3.5-turbo is used to measure the rejection rate. what explains this difference in results for chatGPT given that it is used as a reference?
image

We use two methods for evaluation: exact match and chatgpt evaluation. When evaluating rejections, since LLMs sometimes do not completely follow our requirements for rejection text, we need to use chatgpt to determine whether the model has rejected or not.

(1) All of the ChatGPT in this paper is the gpt-3.5-turbo api.
(2) Rej is measured by exact match. If th span -- insufficient information is contained in the generation, this generation is regared as rejecting.
(3) Rej* is measure by ChatGPT. Although in the instruction, we ask LLMs generate 'I can not answer the question because of the insufficient information in documents.' if the document does not contain the answer, LLMs can not always follow the instruction, and may generate some unexpected rejection sentences like The document does not provide information about xxx. In this case, we use chatgpt to determine whether the generation can be regarded as rejecting.

In this case, if a model has rej=35% and rej*=45%, can we say that its rejection rate is 35+45=80%?

No, both rej and rej* are rejection rate, but they are obtained by different ways — exact match and chatgpt respectively.

if the generation contains "insufficient information", chatGPT will consider the generation as a reject. isn't there a risk that chatgpt will also count the rej in rej*?

Yes, so rej* is higher than rej. Rej may miss some rejection generation. We use chatgpt to obtain more precise rejection rate, i.e. rej* while human evaluation is better.