BigQuery foreign data wrapper does not get all rows
reisepass opened this issue · 8 comments
Bug report
It appears that the BigQuery foreign data wrapper is not bringing all rows over to postgres in all situations.
Describe the bug
As suggested in the blog post I am using the FWD for a very simple etl sync job. Just getting any new data from bigquery into postgres. But it appears that only an arbitrarily cut off number (35000) of rows are being considered for the input.
To Reproduce
`
create foreign table bq_entries (
id INTEGER
,project_shortname text
,user_shortname text
,description text
,user_id INTEGER
,firstname text
,lastname text
,email text
,duration INTEGER
,day DATE
)
server my_bigquery_server_lab
options (
table 'entries_join', -- Note that this is a view inside bigquery
location 'europe-west6',
startup_cost '42'
);
select count(*) from local_entries --149932, is the count before new insert. This data was brought in the first time via CSV. Now creating a regular updating job for getting data from bigquery
insert into local_entries
select b.* from bq_entries b
where b.id not in ( select id from local_entries )
select count(*) from local_entries; -- 170612 resulting merged count is smaller than expected. (20680 were added via insert. Expected 87729 to be added)
select count(*) from bq_entries; --35000 the wrong count. On bigquery there are 258341. It is also strange that this is such a clean number 35000 seems like some kind data import limit setting.
select count(distinct id) from bq_entries; -- 258341 the correct count. <-- This is most interesting a different type of aggregation gets the correct count. It is of course also just very strange that a count(distinct) would get more rows than without a distinct.
`
The problem can be simplified. Even just pulling data over straight brings only 35000; see below
create table tmp2_bg_entries as select * from bq_entries;
select count(*) from tmp2_bg_entries ; -- 35000
Idk, is it a coincident that this R bigquery package also has a default limit of 35000 ? https://cran.r-project.org/web/packages/bigrquery/bigrquery.pdf
omg so strange. If i don't pull data from a view but pull data directly from a table the limit of rows transferred is lower.
select count(*) from bq_engries_ch ; --17500 incorrect
select count(distinct id) from bq_engries_ch ; --296002 correct
create table tmp3_bg_entries_ch as select * from bq_engries_ch;
select count(*) from tmp3_bg_entries_ch; --19444 <- really strange number of rows being transferred
@reisepass thanks for reporting this issue, are you using the Wrappers on Supabase or directly using from source code? The latest code should fix this issue, but it hasn't been deployed on Supabase yet.
Thanks for the quick reply. Using it inside Supabase. I suppose I have no option for "cargo pgx run" inside supabase so i'll have to wait then.
Supabase has upgraded to Wrappers v0.1.9 which should fix this issue, would you able to upgrade your project and try again?
This issue has been fixed in version v0.10.0.
@burmecia would you mind pointing to the PR or commit that fixed this issue? It would be great to learn what was the specific cause.
Sorry for bothering.