defog-ai/sql-eval

Possible Error Leading to Inaccurate Evaluation Outcomes

Closed this issue · 4 comments

Hey there, I'm no expert so maybe I'm mistaken (So please verify twice :P) but I think the way you evaluate openai is not the way you intended. I tried to replicate the evaluation and I came across this issue:
This is the prompt
https://github.com/defog-ai/sql-eval/blob/main/prompts/prompt_openai.md

But
https://github.com/defog-ai/sql-eval/blob/024ff013d02d5fac248fb56b99279cdb16d70aa0/query_generators/openai.py#L123C4-L123C4

When changing the prompt for each question, you call .format on the text of the user prompt only leaving the assistant prompt part untouched. And then you call the generate function but your assistant_prompt part is:

Given the database schema, here is the SQL query that answers `{user_question}`:
```sql

Without any user_question parameter. Meaning that this^ is exactly what the model gets as input, which changes the input, which can change the results, which can change the outcome and insights.

@rishsriv @wongjingping

The question is included in the user prompt. Please let us know if you’re able to get better results with other prompts.

The question is included in the user prompt. Please let us know if you’re able to get better results with other prompts.

I'm aware, I just thought your intention was to include part of the assistant's response with the question repeated, if that is possible, and for the LLM to continue prediction from there. At least, that is how I understood it based on the code. Is my understanding incorrect?

That's a fair hypothesis :) In practice, we found that the Assistant prompt matters very little and is generally ignored by OpenAI in ChatCompletion endpoint. We included it to maintain comparability with the other prompts.

If we add the line assistant_prompt = assistant_prompt.format(user_question=question), the results are the same. gpt-4-turbo accuracy is 83% with our current code, and remains 83% with the line above added in.

Having said that – we should just fix it in the code anyway. Have opened a PR here, and we'll merge it on Monday. Thanks for reporting!

That's a fair hypothesis :) In practice, we found that the Assistant prompt matters very little and is generally ignored by OpenAI in ChatCompletion endpoint. We included it to maintain comparability with the other prompts.

If we add the line assistant_prompt = assistant_prompt.format(user_question=question), the results are the same. gpt-4-turbo accuracy is 83% with our current code, and remains 83% with the line above added in.

Having said that – we should just fix it in the code anyway. Have opened a PR here, and we'll merge it on Monday. Thanks for reporting!

Thank you for taking time to explain!