Write query performance tests
jamesaoverton opened this issue · 12 comments
The primary goal of the test data set in #16 is to test VALVE load performance. I'd also like to use it to test query performance. These tests will be similar to what we're doing in sprocket, and can be written in Python. I'm mostly interested in JSON output. Given a table T, all these tests should account for rows in both T and T_conflict. I want to measure how long it takes to:
- count all rows
- count all rows matching a WHERE LIKE condition on a column that is not indexed
- return 100 rows as JSON with simple values:
{"row_number": 99900, "col1": "val1", ...}
- return 100 rows as JSON with "cell" objects:
{"row_number": 99900, "col1": {"value": "val1", "messages": []}, ...}
- return the first 100 rows that have errors as JSON with cells
- export the message table for T
Once this is in place, I have some questions:
- For 3 and 4, does an offset makes any difference?
- Is it faster to query T_view once, or T and T_conflict separately?
Checked with @jamesaoverton to clarify Question 1, and the idea here is to check whether querying from the beginning vs. the end of the table makes a difference (e.g. select ... limit 100 offset 10000
)
I wrote the following test script: https://github.com/ontodev/valve.rs/blob/perf-tests/test/perf_test_data/query_tests/query_tests.py that has the following usage:
usage: query_tests.py [-h] [--vacuum] runs table column like limit offset db
positional arguments:
runs The number of times to repeat each test
table The name of a table to run the tests on
column The name of the column to run the tests on
like The initial substring to match the column value against
limit Maximum number of rows to fetch in LIMIT queries
offset Offset to use for LIMIT queries
db The database to query against
optional arguments:
-h, --help show this help message and exit
--vacuum Clear cache after every query
I then ran it using the following command line arguments on a database table, table2
, containing 10,000 rows:
test/perf_test_data/query_tests/query_tests.py --vacuum 100 table2 bar a 100 900 build/valve_perf.db
I then beautified the resulting JSON using Duckduckgo's json beautifyer tool: https://duckduckgo.com/?t=ffab&q=json+format&ia=answer
I am attaching the result.
output_pretty.txt
I haven't tested the performance of using the export script to export the message table yet.
This is great!
My feeling is that something like 10 runs is enough, but 10,000 rows is not enough. I have tables with several million rows that I want to work with quickly. The tests should run over 100,000 or maybe a million rows. I would like to see a summary without all the hundreds of result rows.
For beautifying JSON I like jq
.
Unfortunately the data generation script is not the most efficient. I'll kick off generation of 100,000 rows and see how long it takes. 1,000,000 rows will not be possible with the current script, but it might be possible to optimize it.
(to clarify, I don't mean https://github.com/ontodev/valve.rs/blob/perf-tests/test/perf_test_data/query_tests/query_tests.py, I mean https://github.com/ontodev/valve.rs/blob/perf-tests/test/generate_random_test_data.py)
(BTW, thanks for suggesting dq
. Although I used duckduckgo above, I actually normally like to use pyton's json module for this, in the following way: COMMAND | python3 -m json.tool
. The output is usually pretty nice.)
Anyway to help with the performance issue, I've made a small modification to the data generation script. It still isn't the most efficient, but the modification is enough to allow us to generate 6 tables of 100,000 rows each in a not-completely-unreasonable (given that we only have to do it once) amount of time: ~1hr.
I also changed the query_tests.py script so that the actual rows are not included in the results, as @jamesaoverton requested.
Unfortunately the load/validation via VALVE is taking a loooong time to run. It seems to me that the main culprit is the foreign key check which needs to be done while loading some of the tables. The tables without foreign keys load pretty quickly even at 100,000 rows. The tables with foreign keys take forever. One table took almost three hours. There are about three tables like that one! The load has been running since this morning and it is now 3:30PM Munich time.
Anyway I am just going to wait for it to finish and then I will make sure not to delete that database. After it is done I will then run:
test/perf_test_data/query_tests/query_tests.py --vacuum 10 table2 bar a 100 99900 build/valve_perf.db | python3 -m json.tool
and post the results.
I'm thinking of these query performance checks as distinct from the validation tests in #24. To test validation and loading, 10,000 rows across several tables is good. To test queries (this issue) we only need one table, but it has to be bigger. That one table could be from #24 but it doesn't have to be. It doesn't even have to be loaded using VALVE.
Agreed that these tests are separate and do not need to be linked. The problem is just that I don't have a table that big to test on and don't have a way to generate one other than to use valve. I didn't realise it would take quite this long to load, but now that it's started I'd rather just let it run to completion. It's just running in the background now (I'm not actively monitoring it, just occasionally). Hopefully it will be only another few hours. Otherwise I'll look into another way to do it tomorrow.
Your last comment made me realise, though, that I could modify table.tsv
and column.tsv
to eliminate all but the one table I want to load for the query test, so no foreign key validation should need to be performed. It might still take a little while to load that one table but hopefully not anywhere near the time it was taking to load all six tables
The only potential downside is that without the other tables there will be fewer errors, which will potentially make the "first 100 rows that have errors as JSON with cells" test less meaningful. Anyway we'll see.
Ok, I used a table without a lot of foreign, tree, or under keys defined on it, and since there weren't enough datatype errors in the table I just manually added about 120 or so. Attached is a similar JSON to what I uploaded yesterday, but this table has 100,000 rows in it.
output_pretty.txt
I was wondering why so few errors were being generated by the data generation script and realised that there is a bug in it that makes it such that it is almost impossible for an error to actually be generated unless it is a structural error. I have now fixed that bug and am going to run the query tests again once I have loaded up a table's worth of data. That should give us a more representative sample. Note that I am generating data with 5% errors in it.
Ok here is some better data. The table, table1
, has 100,000 rows. 2,476 of these rows have metadata. In addition there are 2,541 rows in table1_conflict
.
Note that specifying an offset of N does not mean (as I originally thought) to query the table beginning from row N. Rather, it means, run the query on the whole table, and then skip the first N rows from the results that are returned. With that in mind I specified an offset of 2000 by running the script with the following command line options:
test/perf_test_data/query_tests/query_tests.py --vacuum 10 table1 base a 100 2000 build/valve_query.db | python3 -m json.tool | tee output_pretty.txt
Here are the results:
output_pretty.txt
After getting some feedback from @jamesaoverton about precisely what the contents of these tests should be, I produced the following Excel spreadsheet with the appropriate statistics. Note that some of the tests included here are related to #27
Schema comparison.xlsx