Text Search returns documents in different order when called multiple times
gowthamvbhat opened this issue · 6 comments
I have a user collection with a text index of the format,
{
v: 1,
key: { _fts: 'text', _ftsx: 1 },
name: 'address.addressLine2_text_address.addressLine1_text_profile.email_text',
ns: 'db.user',
weights: {
'profile.email': 1,
'address.addressLine1': 1,
'address.addressLine2': 1
},
default_language: 'english',
language_override: 'language',
textIndexVersion: 3
}
Using the search_text()
on this index returns the correct set of docs but the ordering changes every time the method is called.
Ex: User.objects.search_text("foo")
Running a raw text search query with the same parameters returns the result in the same order every time.
Ex: User.objects((__raw__={'$text': {'$search': "foo"}})
The internal query being run by both these methods seem to be the same.
Why is the behaviour inconsistent in case of search_text()
?
This is particularly surprising, if the number of returned docs are limited resulting in completely different set of docs being returned from both the queries
What should be the structure of the reproducible snippet?
I went through the Git history and found this commit fixed a order in command which explicitly adds the SON. May be this could help?
P.S - I am using, mongoengine - 0.24.2, pymongo - 3.12.1 and MongoDB Atlas Version 4.4.26
A reprocible snippet is a minimal code snippet that demonstrates the issue (minimal model definition, no FLask , etc), more or less like a unit test. I couldn't reproduce your issue so far so it's not clear if the casting to SON is the issue.
A reproducible snippet is something like
from mongoengine import *
connect() # assuming you have a running local mongo on default port (without authentication)
class News(Document):
title = StringField()
content = StringField()
is_active = BooleanField(default=True)
meta = {
"indexes": [
{
"fields": ["$title", "$content"],
"default_language": "portuguese",
"weights": {"title": 10, "content": 2},
}
]
}
result1 = list(News.objects.search_text('brasil'))
result2 = list(News.objects(__raw__={'$text': {'$search': "brasil"}}))
assert result1 == result2 # whatever you want to prove and is reproducible
if the issue is not always reproducible, make it a for loop with enough iteration to have it failing
Thanks for the detailed explanation. I have created a snippet along with a sample json dataset.
Please run populate_initial_data()
once, to load the sample data into your db.
Demo Files - issue_2759_demo.zip
Alright, It turned out to be more complicated to get to the actual root cause but this is due to the last if
in the following part of MongoEngine
@property
def _cursor_args(self):
fields_name = "projection"
# snapshot is not handled at all by PyMongo 3+
# TODO: evaluate similar possibilities using modifiers
if self._snapshot:
msg = "The snapshot option is not anymore available with PyMongo 3+"
warnings.warn(msg, DeprecationWarning)
cursor_args = {}
if not self._timeout:
cursor_args["no_cursor_timeout"] = True
if self._allow_disk_use:
cursor_args["allow_disk_use"] = True
if self._loaded_fields:
cursor_args[fields_name] = self._loaded_fields.as_dict()
if self._search_text: # <-- here
if fields_name not in cursor_args:
cursor_args[fields_name] = {}
cursor_args[fields_name]["_text_score"] = {"$meta": "textScore"}
return cursor_args
It makes it add the following projection to the pymongo command
`('projection', {'_text_score': {'$meta': 'textScore'}})
Thus it's running something in that fashion
user_col = User._get_collection()
list(user_col.find({'$text': SON([('$search', '\"102\"')])}, {"score": {"$meta": "textScore"}, "_id": 0}))
and that pymongo query doesn't provide reproducible result
This whole text_score thing seems to be connected with this method on Document
def get_text_score(self):
if "_text_score" not in self._data:
raise InvalidDocumentError(
"This document is not originally built from a text query"
)
return self._data["_text_score"]
So long story short, using seach_text
makes MongoEngine add this annoying projection to the query out of the box...
I'm opening a PR and will add a text_score
argument to .search_text() so you can turn that behavior off. text_score will remain True by default to avoid introducing a breaking change
Thanks for spending your time in debugging and fixing this issue. The PR looks good!