My evaluation accuracy is around 50%,where did I go wrong?

Question

My evaluation accuracy is around 50%,where did I go wrong?

Closed this issue 4 months ago · 3 comments

I can't use the gpt4,so I change the code and use TONGYIQIANWEN instead, and run the generate_quesion.py

--split=test
--prompt_repr=SQL
--k_shot=9
--example_type=QA
--selector_type=EUCDISQUESTIONMASK

The result file is here
questions.json

And I run the ask_llm.py. Because the TONGYIQIANWEN only have chat completion API, didn't have completion API, so I choose the ask_chat() function process in the chatgpt.py . Maybe this reason, the output result contains predicted sql and other response information.

--openai_api_key=sk-***********
--base_url=https://dashscope.aliyuncs.com/compatible-mode/v1
--model=qwen-plus
--question=./out_put
--start_index=0
--end_index=10000

The result file is here
RESULTS_MODEL-qwen-plus.txt

so I write a function to get the sql from the output result file.
After that, I run the evaluation.py in spider project,the accuracy is around 50%.
And I found the direct reason is many predicted sql cause exception when invoke get_sql() function in process_sql.py.
Here is the spider evaluation output result on console, I copy it to the file.
evalution_result.txt

So where did I go wrong?

Answer 1 · 2024-11-25T02:54:01.000Z

Hi. When we were using an earlier version of Qwen, we encountered the same issue. The output, which included both SQL and analysis, led to parsing errors. This might be due to Qwen having a large amount of data for analysis before providing answers during the SFT phase. Perhaps you could try the latest version, Qwen2.5-coder.

Answer 2 · 2024-11-27T03:35:45.000Z

Hi. When we were using an earlier version of Qwen, we encountered the same issue. The output, which included both SQL and analysis, led to parsing errors. This might be due to Qwen having a large amount of data for analysis before providing answers during the SFT phase. Perhaps you could try the latest version, Qwen2.5-coder.

Thank you for your reply.
I found in spider evaluation.py, the exception which thrown by get_sql() function in process_sql.py will reduce the execution accuracy. And some alias of columns in sql will also cause excepton by get_sql() function, even though the sql is actually correct.

I change the evaluation to the https://github.com/taoyds/test-suite-sql-eval . I found the exception thrown by get_sql() function will not reduce the execution accuracy in test-suite-sql-eval evaluation, I think this is reasonable.

The final execution accuracy in test-suite-sql-eval evaluation is around 80%, whether using qwen2.5-coder-32b or Qwen-plus.

Answer 3 · 2024-11-27T03:42:45.000Z

By the way ,the exact match's accuracy is always around 50%. I think the reason is the exact match's code in spider or test-suit-sql-eval project is not perfect, it cannot reflect whether two SQL statements are equivalent.