rejection rate of chatGPT

Question

rejection rate of chatGPT

valdesguefa opened this issue a year ago · 8 comments

in the article it says that gpt-3.5-turbo is used to measure the rejection rate. what explains this difference in results for chatGPT given that it is used as a reference?

Answer 1 · 2023-09-18T08:26:16.000Z

We use two methods for evaluation: exact match and chatgpt evaluation. When evaluating rejections, since LLMs sometimes do not completely follow our requirements for rejection text, we need to use chatgpt to determine whether the model has rejected or not.

Answer 2 · 2023-09-18T11:51:20.000Z

since chatGPT uses GPT-3.5-Turbo. rej and rej* results for chatGPT should be the same or I'm wrong. Le lun. 18 sept. 2023 à 09:26, jiawei ***@***.***> a écrit :

…

We use two methods for evaluation: exact match and chatgpt evaluation. When evaluating rejections, since LLMs sometimes do not completely follow our requirements for rejection text, we need to use chatgpt to determine whether the model has rejected or not. — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AR2EP3O34HYNHMDEPR5FJJLX3AATFANCNFSM6AAAAAA434R54I> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 3 · 2023-09-18T12:18:51.000Z

(1) All of the ChatGPT in this paper is the gpt-3.5-turbo api.
(2) Rej is measured by exact match. If th span -- insufficient information is contained in the generation, this generation is regared as rejecting.
(3) Rej* is measure by ChatGPT. Although in the instruction, we ask LLMs generate 'I can not answer the question because of the insufficient information in documents.' if the document does not contain the answer, LLMs can not always follow the instruction, and may generate some unexpected rejection sentences like The document does not provide information about xxx. In this case, we use chatgpt to determine whether the generation can be regarded as rejecting.

Answer 4 · 2023-09-18T12:56:44.000Z

Thank you for the clarification. Le lun. 18 sept. 2023 à 13:19, jiawei ***@***.***> a écrit :

…

(1) All of the ChatGPT in this paper is the gpt-3.5-turbo api. (2) Rej is measured by exact match. If th span -- insufficient information is contained in the generation, this generation is regared as rejecting. (3) Rej* is measure by ChatGPT. Although in the instruction, we ask LLMs generate 'I can not answer the question because of the insufficient information in documents.' if the document does not contain the answer, LLMs can not always follow the instruction, and may generate some unexpected rejection sentences like The document does not provide information about xxx. In this case, we use chatgpt to determine whether the generation can be regarded as rejecting. — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AR2EP3MWJ6C55QVQNSWTOCTX3A33NANCNFSM6AAAAAA434R54I> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 5 · 2023-09-23T10:57:06.000Z

In this case, if a model has rej=35% and rej*=45%, can we say that its rejection rate is 35+45=80%?

Answer 6 · 2023-09-23T11:06:24.000Z

No, both rej and rej* are rejection rate, but they are obtained by different ways — exact match and chatgpt respectively.

Answer 7 · 2023-09-23T11:51:49.000Z

if the generation contains "insufficient information", chatGPT will consider the generation as a reject. isn't there a risk that chatgpt will also count the rej in rej*?

Answer 8 · 2023-09-23T12:57:42.000Z

Yes, so rej* is higher than rej. Rej may miss some rejection generation. We use chatgpt to obtain more precise rejection rate, i.e. rej* while human evaluation is better.