duckdb/duckdb-postgres

const pages_per_task parameter cause long query times on certain tables

Closed this issue · 1 comments

What happens?

The Postgres scanner works really well, but certain tables, in particular, demonstrate long query times, taking about an hour to load the full table. Some tables of similar size, on the other hand, work much faster. Therefore, we started investigating to find the cause of the slow ingestion from Postgres. It turned out that this was caused by a parameter that controls the number of rows per page returned in parallel during data downloading. After adjusting the 'pages_per_task' parameter from the default value of 1000 to 1,000,000, the average time reduced to 467.30 seconds from the initial one hour. That's why I believe that 'pages_per_task' should be exposed as a config parameter I think it would the better if the parameter will be optional. I've made a PR with solution where parameter is required (since I don't know if actual c++14 must stay or not and I don't have any idea how to manage this without extra modules or bumping the C++ version to C++17)

To Reproduce

To reproduce we need a table that contain very large amount of rows and columns.

OS:

any

PostgreSQL Version:

any

DuckDB Version:

any

DuckDB Client:

any

Full Name:

Damian Petta

Affiliation:

.

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Here is a PR that I previously mentioned -> #99