`startwith` does not honour the `FULL_TEXT` behaviour

Question

`startwith` does not honour the `FULL_TEXT` behaviour

slorg1 opened this issue 8 years ago · 9 comments

Hi,

I am sorry to bug you again so soon ;-)

I am testing it out a column having FULL_TEXT set as a keygen.

The symptom I get is that if I search something like a it is going to find the word apple nicely.
However, if I search for a a or an a it is not going to find an apple.

From what I can tell, the issue is that the FULL_TEXT function which produces a list of "tokens" is not used by Query._check (L66). The .lower() emulates / duplicates some of that.

I do not know how hard it would be, it seems to me like Query._check should run the rom.util.* functions (FULL_TEXT, SIMPLE_CI, CASE_INSENSITIVE or IDENTITY_CI). It would allow for the code to be better centralised (no duplication) and to ease the addition of a new keygen. Then Query._check would either return the list of tokens or yield them one at the time. Finally, in any of the startwith, endwith etc. instead of adding 1 Pattern per input value, it could append 1 pattern per token generated by Query._check.

This is only a suggestion to fix the issue that I found. I am not sure that it is the only way or even the right way. You would know that better than I would.

Thank you in advance.

Answer 1 · 2016-12-06T06:31:38.000Z

The only reason .lower() is performed is because I believe that if you're using a case-insensitive keygen, you're probably going to be wanting case-insensitive search, because providing a cased argument to a case-insensitive field is otherwise nonsensical. That's the only strange case here.

With respect to (...keygen=FULL_TEXT, prefix=True) + .startswith(col='an a'), that's definitely not going to work the way you want, for the reasons you state (words are tokenized as part of indexing process, but not querying).

But no, I won't be adding auto-tokenization to Query._check because it would almost immediately invalidate a countless amount of code that already exists that doesn't expect tokenization, it wouldn't be forward compatible with new-style keygens (which receives the full entity data before generating a key for the index), and because (with using the keygen) it creates a really odd case where: .filter(col=['foo', 'bar']) != .filter(col='foo bar') (the short version is because col=[...] is an OR/UNION query, but using the keygen as part of the automatic processing of the passed-in string data is an AND/INTERSECTION query).

Answer 2 · 2016-12-06T14:52:23.000Z

@josiahcarlson Thank you very much for the feedback.

Could you please suggest a way to make the intersection in Redis using your query language? Or is it not supported at all?

Thank you in advance.

Answer 3 · 2016-12-06T15:59:12.000Z

query.filter().filter().startswith().startswith() is a series of intersections of filters and startswith operators. But it is not clear to me what you have and what you want to search for, given what you have provided above. I don't know if you are looking for phrase searching, phrase + prefix searching on words, or just prefix searching + intersection. If you could describe the data you have and the types of queries you would like to perform over it, I can answer how you can do that with rom.

Answer 4 · 2016-12-06T21:14:15.000Z

Hi,

Sure, to take the "apple" example this is what happens to me.

Dataset:
"an apple"
"a banana"
"a kiwi"

If my user searches for "a k", he/she will find nothing.

The real dataset is first and last name but you can understand that it is rigorously the same.
As soon as one space is inserted: the search fails.

Thank you.

Answer 5 · 2016-12-07T06:29:28.000Z

unparsed = 'a k'
q = MyModel.query
for pfix in rom.FULL_TEXT(unparsed):
    q = q.startswith(col=pfix)
count = q.count()
results = q.all()

Answer 6 · 2016-12-09T18:59:14.000Z

Hi @josiahcarlson .

Thank you for getting back to me and sorry for the delay in my response.

OK, so that is what I thought startswith would do on my behalf. I may not have suggested it in a clear way, that is what I meant to suggest with this ticket.

To confirm: you do not think that these methods should do that automatically. Am I getting this correctly?

Thank you.

Answer 7 · 2016-12-12T06:20:36.000Z

I am choosing to not make these methods do as you request, no, for the reasons I've already described above in the paragraph that begins with "But no".

Answer 8 · 2016-12-12T12:37:02.000Z

Hi @josiahcarlson ,

Right, I had understood your reasoning with the paragraph that begins with "But no". It seemed like your answer was bound to Query._check. Following up on your suggestion for the startwith method. I was wondering if I had misunderstood and using/implementing your suggestion is startwith (and not Query._check) should do that.

That would address your issue and (I think) make startwith work more as expected.

Thank you.

Answer 9 · 2016-12-13T06:20:12.000Z

I don't understand what you want or what you are trying to say, Your paragraph above is completely incomprehensible.

I'm not changing the library for the reasons I have already provided (inconsistent behavior between arguments/types/methods). Further, not a single database or library I have ever used and respected parses queries as you describe. Not to say that such databases/libraries don't exist, I'm just saying that I don't respect them. And if I don't respect them, I sure as hell am not going to emulate them. I will not be responding to this thread again.