const pages_per_task parameter cause long query times on certain tables
Closed this issue · 1 comments
What happens?
The Postgres scanner works really well, but certain tables, in particular, demonstrate long query times, taking about an hour to load the full table. Some tables of similar size, on the other hand, work much faster. Therefore, we started investigating to find the cause of the slow ingestion from Postgres. It turned out that this was caused by a parameter that controls the number of rows per page returned in parallel during data downloading. After adjusting the 'pages_per_task' parameter from the default value of 1000 to 1,000,000, the average time reduced to 467.30 seconds from the initial one hour. That's why I believe that 'pages_per_task' should be exposed as a config parameter I think it would the better if the parameter will be optional. I've made a PR with solution where parameter is required (since I don't know if actual c++14 must stay or not and I don't have any idea how to manage this without extra modules or bumping the C++ version to C++17)
To Reproduce
To reproduce we need a table that contain very large amount of rows and columns.
OS:
any
PostgreSQL Version:
any
DuckDB Version:
any
DuckDB Client:
any
Full Name:
Damian Petta
Affiliation:
.
Have you tried this on the latest master branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree
Here is a PR that I previously mentioned -> #99