Data set mismatch: empty queries results and reproducibility issues

Question

Data set mismatch: empty queries results and reproducibility issues

Bouncner opened this issue 3 years ago · 7 comments

We execute the join order benchmark as one of our "default" benchmarks in Hyrise. We recently found that several queries yield empty results and wondered if this might be a problem with Hyrise, the data set, or if it is part of the JOB by intention.

One example is query 32a, which selects k.keyword ='10,000-mile-club'. For our data set (we generated the data using the "frozen data set"), there is no such keyword but rather 10000-mile-club (see also here). Interestingly, this keyword with the comma exists in the CWI data set.

As a query's performance can thus differ vastly between the two data sets due to the empty results, we would consider it a bug in either the CWI data set or the query's SQL. Should this be fixed by adjusting the dataset generation from frozen data or by adjusting the query's SQL by removing the , from 10,000...?

For other queries, we found that inner joins following several filters lead to empty result sets (e.g., query 5a). We are again curious if this is done by intention? As far as we can tell, this is the case for both data sets (CWI and frozen).

Pinging @Bensk1 here as well as he did the data generation test.

Answer 1 · 2021-07-01T16:38:06.000Z

@Bouncner - Happy to incorporate any changes to make this better/easier.
Saw this tweet - https://twitter.com/hyrise_db/status/1410207307024187405?s=20

Answer 2 · 2023-03-14T07:44:12.000Z

Hey, I also met this problem in Postgres!

Firstly, thanks for your great work!

I run JOB in PostgreSQL 15 and found queries 2c.sql, 5a.sql, 5b.sql, 10b.sql and 32a.sql will yield empty results. Such queries will be processed quickly even if they have complex join relations, making them useless to catch the relation between the runtime and the query's complexity.

I am also curious about if this is done by intention. If not, will you try to replace such queries with empty results?

Answer 3 · 2023-03-14T09:22:17.000Z

@HennyNile : are you using the static data set that is listed in the initial paper?

Also, take a look at this pull request. We contacted Viktor Leis some time ago and he confirmed that there are empty results and they are on purpose. But I think 2c should yield a non-empty result.

Answer 4 · 2023-03-14T12:20:50.000Z

@Bouncner For convenience, I just used a dumped dataset in Harvard website which even has fewer rows than the initial dataset you mentioned. Maybe I should change to use the initial dataset.

There is no problem As the CSV files of the paper data set do not follow RFC 4180 (i.e., using "" to escape an " in a string, which Hyrise assumes to be the case for CSV files) in Postgres.
For example, in your pr, you mentioned

Exemplary lines before adaption:

movie_info.csv: 127472,2130098,13,"FACT: Dunn uploads a file from an Apple Powerbook in \"C:\\\", which would be appropriate for a DOS/Windows system.",
person_info.csv: 2010004,1163574,25,"CD: \"All-Time Hits, Vol. 2\"\\",

Reformatted:

movie_info.csv: 127472,2130098,13,"FACT: Dunn uploads a file from an Apple Powerbook in ""C:\"", which would be appropriate for a DOS/Windows system.",
person_info.csv: 2010004,1163574,25,"CD: ""All-Time Hits, Vol. 2""\",

In PG, it is

table movie_info: 127472, 2130098, 13, FACT: Dunn uploads a file from an Apple Powerbook in "C:\", which would be appropriate for a DOS/Windows system.
table person_info: 2010004, 1163574, 25, CD: "All-Time Hits, Vol. 2"\

So, I think the empty results in Postgres are not caused by this problem.

For 2c, there are only 2 filter predicates and I found both predicates only fetch one row. This may lead to an empty result.

There are 5 empty results in my experiments. In my opinion, the empty results of my experiments are intentional or caused by the wrong dumped dataset. When I test the initial dataset, I will share the result with you.

Answer 5 · 2023-03-15T13:20:39.000Z

Viktor Leis wrote to us "If I remember correctly, there are around 10 queries with empty results sets.".

He strongly suggested to use only the data set used in the paper. At least for us, that removed a number of empty results.

Answer 6 · 2023-03-17T05:29:41.000Z

I see. I will use the initial dataset used in the paper in the later work. Many thanks!

Answer 7 · 2023-03-21T15:46:07.000Z

@Bouncner - Happy to incorporate any changes to make this better/easier.
Saw this tweet - https://twitter.com/hyrise_db/status/1410207307024187405?s=20

Sorry, that took a while. :(