SectorLabs/django-postgres-extra

bulk_insert with uuid as pk the returning don't match?

jiamo opened this issue · 2 comments

jiamo commented

with such method

        insert_rows = Story.objects.on_conflict(['url'], ConflictAction.NOTHING).bulk_insert(news_list)
        actions = [
            {
                "_op_type": "create",
                "_id": row["uid"],
                "_index": "news_story",
                "_source": row
            } for row in insert_rows
        ]
        bulk(get_connection(), actions)

I think the result should be same on db and elasicsearch.
But I got a very strange problem:

Screen Shot 2021-12-10 at 6 53 22 PM

and db here:
Screen Shot 2021-12-10 at 6 53 16 PM

The db picture tell the two record are in the same transaction.
It seem like the return the returning uid don't match inserted uid in db

jiamo commented

I think there is bug in bulk_insert.

After I have added the second check code

        insert_rows = Story.objects.on_conflict(['url'], ConflictAction.NOTHING).bulk_insert(news_list)

        return_row_dict = {row["url"]: row["uid"] for row in insert_rows}
        q = Story.objects.values(
            'uid','url'
        ).filter(url__in=return_row_dict.keys())
        query_again = {row["url"]:row["uid"] for row in q}
        if return_row_dict != query_again:
            for url, uid in return_row_dict.items():
                if str(query_again[url]) != str(uid):
                    print("return url {} return uid {} query again uid {}".format(url, uid, query_again[url]))

            sys.exit()

I got return url https://www.engadget.com/xbox-game-pass-pc-rebrand-035131086.html return uid dd2446bb-acf1-44cd-86bf-5e2dcea6a01a query again uid 5d6cef64-62ef-4d79-b77d-449ca72277c5

Which the inserted return value don't matched success inserted into db values.

And the return uid dd2446bb-acf1-44cd-86bf-5e2dcea6a01a is a new uid for a new url.
The query again uid 5d6cef64-62ef-4d79-b77d-449ca72277c5 was an old uid for the same url in db.

It seem bulk_insert don't insert data with conflict right (I have ["story_url_unique" UNIQUE CONSTRAINT, btree (url) ])
But it return the wrong uid for a new url.

A simple way to say:
On conflict do nothing should not return the rows not insert. But at now mistake the pk for the inserted row
So If I want to used the inserted_return_data I must query again for the pk. suck like

        actions = []
        for row in insert_rows:
            row["uid"] = query_again[row["url"]]  

And query_again must query all new_list . which I can't just have the benefits from insert_rows(only return the inserted values)

I tried my best to reproduce this with the limited sample code you provided, but I can't. Since this has been open for a while, I am going to close this.

If you can still reproduce this, please provide a reproducible sample and I can take another look.