MongoEngine/mongoengine

Text Search returns documents in different order when called multiple times

gowthamvbhat opened this issue · 6 comments

I have a user collection with a text index of the format,

{
    v: 1,
    key: { _fts: 'text', _ftsx: 1 },
    name: 'address.addressLine2_text_address.addressLine1_text_profile.email_text',
    ns: 'db.user',
    weights: {
      'profile.email': 1,
      'address.addressLine1': 1,
      'address.addressLine2': 1
    },
    default_language: 'english',
    language_override: 'language',
    textIndexVersion: 3
}

Using the search_text() on this index returns the correct set of docs but the ordering changes every time the method is called.
Ex: User.objects.search_text("foo")

Running a raw text search query with the same parameters returns the result in the same order every time.
Ex: User.objects((__raw__={'$text': {'$search': "foo"}})

The internal query being run by both these methods seem to be the same.

Why is the behaviour inconsistent in case of search_text()?

This is particularly surprising, if the number of returned docs are limited resulting in completely different set of docs being returned from both the queries

Tried to quickly reproduce it but I couldn't. Could you provide a minimal reproducible snippet? At first glance I can see a difference between the 2 queries, in the sense that one is using SON

image

What should be the structure of the reproducible snippet?

I went through the Git history and found this commit fixed a order in command which explicitly adds the SON. May be this could help?

P.S - I am using, mongoengine - 0.24.2, pymongo - 3.12.1 and MongoDB Atlas Version 4.4.26

A reprocible snippet is a minimal code snippet that demonstrates the issue (minimal model definition, no FLask , etc), more or less like a unit test. I couldn't reproduce your issue so far so it's not clear if the casting to SON is the issue.

A reproducible snippet is something like

from mongoengine import *

connect()    # assuming you have a running local mongo on default port (without authentication)

class News(Document):
    title = StringField()
    content = StringField()
    is_active = BooleanField(default=True)

    meta = {
        "indexes": [
            {
                "fields": ["$title", "$content"],
                "default_language": "portuguese",
                "weights": {"title": 10, "content": 2},
            }
        ]
    }

result1 = list(News.objects.search_text('brasil'))
result2 = list(News.objects(__raw__={'$text': {'$search': "brasil"}}))

assert result1 == result2    # whatever you want to prove and is reproducible

if the issue is not always reproducible, make it a for loop with enough iteration to have it failing

Thanks for the detailed explanation. I have created a snippet along with a sample json dataset.

Please run populate_initial_data() once, to load the sample data into your db.

Demo Files - issue_2759_demo.zip

Alright, It turned out to be more complicated to get to the actual root cause but this is due to the last if in the following part of MongoEngine

    @property
    def _cursor_args(self):
        fields_name = "projection"
        # snapshot is not handled at all by PyMongo 3+
        # TODO: evaluate similar possibilities using modifiers
        if self._snapshot:
            msg = "The snapshot option is not anymore available with PyMongo 3+"
            warnings.warn(msg, DeprecationWarning)

        cursor_args = {}
        if not self._timeout:
            cursor_args["no_cursor_timeout"] = True

        if self._allow_disk_use:
            cursor_args["allow_disk_use"] = True

        if self._loaded_fields:
            cursor_args[fields_name] = self._loaded_fields.as_dict()

        if self._search_text:   # <-- here
            if fields_name not in cursor_args:
                cursor_args[fields_name] = {}

            cursor_args[fields_name]["_text_score"] = {"$meta": "textScore"}

        return cursor_args

It makes it add the following projection to the pymongo command
`('projection', {'_text_score': {'$meta': 'textScore'}})

Thus it's running something in that fashion

user_col = User._get_collection()
list(user_col.find({'$text': SON([('$search', '\"102\"')])}, {"score": {"$meta": "textScore"}, "_id": 0}))

and that pymongo query doesn't provide reproducible result

This whole text_score thing seems to be connected with this method on Document

    def get_text_score(self):
        if "_text_score" not in self._data:
            raise InvalidDocumentError(
                "This document is not originally built from a text query"
            )
        return self._data["_text_score"]

So long story short, using seach_text makes MongoEngine add this annoying projection to the query out of the box...

I'm opening a PR and will add a text_score argument to .search_text() so you can turn that behavior off. text_score will remain True by default to avoid introducing a breaking change

Thanks for spending your time in debugging and fixing this issue. The PR looks good!