Text Search returns documents in different order when called multiple times

Question

Text Search returns documents in different order when called multiple times

gowthamvbhat opened this issue a year ago · 6 comments

I have a user collection with a text index of the format,

{
    v: 1,
    key: { _fts: 'text', _ftsx: 1 },
    name: 'address.addressLine2_text_address.addressLine1_text_profile.email_text',
    ns: 'db.user',
    weights: {
      'profile.email': 1,
      'address.addressLine1': 1,
      'address.addressLine2': 1
    },
    default_language: 'english',
    language_override: 'language',
    textIndexVersion: 3
}

Using the search_text() on this index returns the correct set of docs but the ordering changes every time the method is called.
Ex: User.objects.search_text("foo")

Running a raw text search query with the same parameters returns the result in the same order every time.
Ex: User.objects((__raw__={'$text': {'$search': "foo"}})

The internal query being run by both these methods seem to be the same.

Why is the behaviour inconsistent in case of search_text()?

This is particularly surprising, if the number of returned docs are limited resulting in completely different set of docs being returned from both the queries

Answer 1 · 2023-12-15T11:41:05.000Z

Tried to quickly reproduce it but I couldn't. Could you provide a minimal reproducible snippet? At first glance I can see a difference between the 2 queries, in the sense that one is using SON

Answer 2 · 2024-01-08T19:53:24.000Z

What should be the structure of the reproducible snippet?

I went through the Git history and found this commit fixed a order in command which explicitly adds the SON. May be this could help?

P.S - I am using, mongoengine - 0.24.2, pymongo - 3.12.1 and MongoDB Atlas Version 4.4.26

Answer 3 · 2024-01-08T20:09:16.000Z

A reprocible snippet is a minimal code snippet that demonstrates the issue (minimal model definition, no FLask , etc), more or less like a unit test. I couldn't reproduce your issue so far so it's not clear if the casting to SON is the issue.

A reproducible snippet is something like

from mongoengine import *

connect()    # assuming you have a running local mongo on default port (without authentication)

class News(Document):
    title = StringField()
    content = StringField()
    is_active = BooleanField(default=True)

    meta = {
        "indexes": [
            {
                "fields": ["$title", "$content"],
                "default_language": "portuguese",
                "weights": {"title": 10, "content": 2},
            }
        ]
    }

result1 = list(News.objects.search_text('brasil'))
result2 = list(News.objects(__raw__={'$text': {'$search': "brasil"}}))

assert result1 == result2    # whatever you want to prove and is reproducible

if the issue is not always reproducible, make it a for loop with enough iteration to have it failing

Answer 4 · 2024-01-09T19:09:34.000Z

Thanks for the detailed explanation. I have created a snippet along with a sample json dataset.

Please run populate_initial_data() once, to load the sample data into your db.

Demo Files - issue_2759_demo.zip

Answer 5 · 2024-01-15T22:36:36.000Z

Alright, It turned out to be more complicated to get to the actual root cause but this is due to the last if in the following part of MongoEngine

    @property
    def _cursor_args(self):
        fields_name = "projection"
        # snapshot is not handled at all by PyMongo 3+
        # TODO: evaluate similar possibilities using modifiers
        if self._snapshot:
            msg = "The snapshot option is not anymore available with PyMongo 3+"
            warnings.warn(msg, DeprecationWarning)

        cursor_args = {}
        if not self._timeout:
            cursor_args["no_cursor_timeout"] = True

        if self._allow_disk_use:
            cursor_args["allow_disk_use"] = True

        if self._loaded_fields:
            cursor_args[fields_name] = self._loaded_fields.as_dict()

        if self._search_text:   # <-- here
            if fields_name not in cursor_args:
                cursor_args[fields_name] = {}

            cursor_args[fields_name]["_text_score"] = {"$meta": "textScore"}

        return cursor_args

It makes it add the following projection to the pymongo command
`('projection', {'_text_score': {'$meta': 'textScore'}})

Thus it's running something in that fashion

user_col = User._get_collection()
list(user_col.find({'$text': SON([('$search', '\"102\"')])}, {"score": {"$meta": "textScore"}, "_id": 0}))

and that pymongo query doesn't provide reproducible result

This whole text_score thing seems to be connected with this method on Document

    def get_text_score(self):
        if "_text_score" not in self._data:
            raise InvalidDocumentError(
                "This document is not originally built from a text query"
            )
        return self._data["_text_score"]

So long story short, using seach_text makes MongoEngine add this annoying projection to the query out of the box...

I'm opening a PR and will add a text_score argument to .search_text() so you can turn that behavior off. text_score will remain True by default to avoid introducing a breaking change

Answer 6 · 2024-01-20T16:47:32.000Z

Thanks for spending your time in debugging and fixing this issue. The PR looks good!