bulk_insert with uuid as pk the returning don't match?
jiamo opened this issue · 2 comments
with such method
insert_rows = Story.objects.on_conflict(['url'], ConflictAction.NOTHING).bulk_insert(news_list)
actions = [
{
"_op_type": "create",
"_id": row["uid"],
"_index": "news_story",
"_source": row
} for row in insert_rows
]
bulk(get_connection(), actions)
I think the result should be same on db and elasicsearch.
But I got a very strange problem:
The db picture tell the two record are in the same transaction.
It seem like the return the returning uid
don't match inserted uid in db
I think there is bug in bulk_insert.
After I have added the second check code
insert_rows = Story.objects.on_conflict(['url'], ConflictAction.NOTHING).bulk_insert(news_list)
return_row_dict = {row["url"]: row["uid"] for row in insert_rows}
q = Story.objects.values(
'uid','url'
).filter(url__in=return_row_dict.keys())
query_again = {row["url"]:row["uid"] for row in q}
if return_row_dict != query_again:
for url, uid in return_row_dict.items():
if str(query_again[url]) != str(uid):
print("return url {} return uid {} query again uid {}".format(url, uid, query_again[url]))
sys.exit()
I got return url https://www.engadget.com/xbox-game-pass-pc-rebrand-035131086.html return uid dd2446bb-acf1-44cd-86bf-5e2dcea6a01a query again uid 5d6cef64-62ef-4d79-b77d-449ca72277c5
Which the inserted return value don't matched success inserted into db values.
And the return uid dd2446bb-acf1-44cd-86bf-5e2dcea6a01a
is a new uid for a new url.
The query again uid 5d6cef64-62ef-4d79-b77d-449ca72277c5
was an old uid for the same url in db.
It seem bulk_insert don't insert data with conflict right (I have ["story_url_unique" UNIQUE CONSTRAINT, btree (url) ]
)
But it return the wrong uid for a new url.
A simple way to say:
On conflict do nothing should not return the rows not insert. But at now mistake the pk for the inserted row
So If I want to used the inserted_return_data I must query again for the pk. suck like
actions = []
for row in insert_rows:
row["uid"] = query_again[row["url"]]
And query_again
must query all new_list
. which I can't just have the benefits from insert_rows(only return the inserted values)
I tried my best to reproduce this with the limited sample code you provided, but I can't. Since this has been open for a while, I am going to close this.
If you can still reproduce this, please provide a reproducible sample and I can take another look.