defog-ai/sql-eval

Incorrect / Inconsistent Test Cases

Closed this issue · 7 comments

I want to thank you for for open sourcing this! This is a great framework and I hope to have some contributions soon!

I wanted to open this to track a few issues I came across in the eval / test cases. I will try to do a more comprehensive check this week and have a PR fixing a few things, but wanted to have this open to track until then.

I have found a couple test cases that have incorrect expected sql and some that can have right answers that are currently being marked wrong:

Incorrect Expected SQL

Example:

What are the top 3 titles of the publications that have the highest number of references cited, ordered by the number of references cited in descending order?

Current Expected we need to find actual reference count and not order by reference number

SELECT publication.title FROM publication ORDER BY publication.reference_num DESC LIMIT 3;

Suggested Expected:

SELECT p.title, COUNT(c.cited) AS reference_count
FROM publication p
JOIN cite c ON p.pid = c.cited
GROUP BY p.title
ORDER BY reference_count DESC
LIMIT 3;

Please provide a list of business names in New York and their average ratings ordered by the highest average rating first.

Current_Expected _we need to do the average based on reviews

SELECT business.name, business.rating FROM business WHERE business.city ILIKE '%New York%' ORDER BY business.rating DESC NULLS LAST;

Suggested Expected

SELECT b.name, AVG(r.rating) AS average_rating
FROM business b
JOIN review r ON b.business_id = r.business_id
WHERE b.city ILIKE '%New York%' AND b.state ILIKE '%NY%'
GROUP BY b.name
ORDER BY average_rating DESC;

Date Validation Inconcistency

Example

How long has it been since the last admitted student?
Current Expected

SELECT CURRENT_DATE - max(admit_term) AS duration_since_last_admitted_student FROM student;

This assumes the user is wanting to know how many days, which is 1842, however, there are other correct answers we don't check for sure as:

SELECT AGE(CURRENT_DATE, MAX(admit_term)) AS duration_since_last_admitted
FROM student;

which is still correct, however, the output is more human readable: 5 years 16 days.

I think a couple options could be either:

  1. Update prompt to specify number of days since last admitted student
  2. Add ability to check for date equivalence

Ordering

We can have the same results, but a different order.

Example

Which states border the state where lake ontario is?

Current Expected

SELECT border_info.border FROM border_info JOIN lake ON border_info.state_name = lake.state_name WHERE lake.lake_name ilike '%Ontario%' ORDER BY border_info.border NULLS LAST;

Also Valid

SELECT b.border
FROM lake l
JOIN border_info b ON l.state_name = b.state_name
WHERE l.lake_name ILIKE '%ontario%'
AND b.state_name = l.state_name;

Thanks very much for this, Caleb! Much appreciated, and these make a ton of sense. We'll go through these and make fixes in the next few days. Also paging @wendy-aw!

Thanks very much for this, Caleb! Much appreciated, and these make a ton of sense. We'll go through these and make fixes in the next few days. Also paging @wendy-aw!

Thank you! I will try and help do a deeper analysis tomorrow. Cheers!

Hi Caleb! Thank you so much for bringing these to our attention and providing valuable feedback! These detailed notes are most helpful and we’ve tried to address them below.

What are the top 3 titles of the publications that have the highest number of references cited, ordered by the number of references cited in descending order?

The descriptions of every column can be found in the respective json files in the defog-data repo. For ease of discussion, we’ve extracted them here.

# cite table
cited: "ID of the publication being cited"
citing: "ID of the publication that is citing another publication"
# publication table
reference_num: "The number of references cited by the publication"
citation_num: "The number of citations received by the publication"

In other words, reference_num = count(c.citing) while citation_num = count(c.cited). So the both the queries below should give the same results. (We keep just the first since it produces the most minimal dataframe without the count column.)

SELECT publication.title 
FROM publication 
ORDER BY publication.reference_num DESC LIMIT 3;

SELECT p.title, COUNT(c.citing) AS reference_count
FROM publication p
JOIN cite c ON p.pid = c.citing
GROUP BY p.title
ORDER BY reference_count DESC LIMIT 3;

That said, thanks to you raising this issue, we found a discrepancy in data values where the reference_num col wasn’t matching the citing col. We’ve since edited it here. So you can update your local database with the new data and both the above queries should give the same result.

How long has it been since the last admitted student?

This is a very good point! We’ve further clarified the question to specify the duration in days: How long has it been in days since the last admitted student?

Which states border the state where lake ontario is?

It is true that the order does not matter in this case. We’ve thus removed the ORDER BY clause in the query.

Just FYI, we do perform normalization of the resulting dataframes in cases like these where the ordering doesn’t matter. However, your comment led us to find that our regex pattern broke down in this particular question because the word “border” contains the word “order”. This has since been fixed in the latest PR.

Many thanks again, Caleb for raising these issues and helping us to make improvements! Do definitely let us know if you spot anything else or have any other feedback.

@wendy-aw Thanks for the quick turnaround. I just found these when looking for answers that were marked wrong, but actually right, but a different format.

Makes sense! Thanks! :)

Does the SQL need to be updated for Please provide a list of business names in New York and their average ratings ordered by the highest average rating first. to include calculating the average?

Apologies I missed out on that one. I realized that there was a redundant aggregate column business.rating that didn't quite match review.rating. Have thus removed business.rating from our data here and also revised all queries that had the column to include calculating the average with AVG(review.rating) here.

Thank you so much again :) !

Closing ✅