tylertreat/BigQuery-Python

Using insertId and still get lots of duplicated entries for data upload

ckreutz opened this issue · 3 comments

Hello!

Great tool! All works like charm except the data upload, where I get lots of duplicated data:

inserted = client.push_rows('dataset', 'table', rows, 'id')

mine is:

inserted = client.push_rows('twitter', 'tweets', rows, 'id')

I also have a list of dictionaries, where on key is called id with a unique number (integer). Still I have some values up to ten times in the test table.

Any hint? Thanks in advance!

I did some further tests and I still get duplicates. Anyone can confirm the issue?

It's not clear to me why this would be happening since you're specifying an insert id. Have you tried uploading rows using the BigQuery API directly or using another client?

I guess I found the problem in the Google Bigquery Docs.

To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. 

I thought that ID is cross-checked with the whole data set and not just remembered for one minute. So no surprise then. Thanks for the reply anyway!