jurismarches/luqum

Allowing double quotes

stevesmit opened this issue · 13 comments

Thank you for the wonderful library.

I have queries that have some field expressions with double quotes. That is, something like the following:

field_name:""expression text""

When these are parsed, they confuse the parser as it thinks that the initial double quotes is an unknown operation and it gets treated as a Phrase.

Here is a sample query and the parsing operation in python:

from luqum.parser import parser
query = 'field_name:""Field Text"" OR field_name:text AND field_name:"more text"'
parser.parse(query)

Here is the current output:

UnknownOperation(SearchField('field_name', Phrase('""')), Word('Field'), OrOperation(Word('Text""'), AndOperation(SearchField('field_name', Word('text')), SearchField('field_name', Phrase('"more text"')))))

This is what the expected output of the parsing operation would look like:

OrOperation(SearchField('field_name', Phrase('""Field Text""')), AndOperation(SearchField('field_name', Word('text')), SearchField('field_name', Phrase('"more text"'))))

Any thoughts on this? The double quotes over the single quotes does have a distinct meaning in this case, hence why I am asking.

Hi @stevesmit where does this double double quote comes from ?
Is it normally supported by Lucene ? For me it's not. See https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

For what I understand, the parser is right there, it's the expression you should fix.

It comes from a company's proprietary query syntax which is very Lucene-like, I suppose. Fair enough - any thoughts on editing the grammar specification to allow parsing such expressions? I don't mind (and would have to!) editing a bit of source on my side.

You'll have to make your own parser.py file (PLY is not very flexible on that point, see issue 49).

You can change PHRASE_RE but I imagine you have to verify that you have as much " at the begining as in the end ". So you may better go duplicating most of it ! So maybe adding a DPHRASE_RE copying PHRASE_RE but with double quote and a t_DPHRASE alike t_PHRASE is the best way to go.

Alright I tried that, but it unfortunately doesn't parse it (still giving the same output as before). I added the following to pieces of code to parser.py as you mentioned:

DPHRASE_RE = r'''
(?P<phrase>  # phrase
  ""          # opening double quotes
  (?:        # repeating
    [^\\"]   # - a char which is not escape or end of phrase
    |        # OR
    \\.      # - an escaped char
  )*
  ""          # closing double quotes
)'''
@lex.TOKEN(DPHRASE_RE)
def t_DPHRASE(t):
    m = re.match(DPHRASE_RE, t.value, re.VERBOSE)
    value = m.group("phrase")
    t.value = Phrase(value)
    return t

Is there any other code that needs to be edited to take account of this change?

Maybe add DPHRASE in precedence, before PHRASE ?

Also add DPHRASE in tokens.

Alright I did that, and I got the following notice when loading the library:

WARNING: Token 'DPHRASE' defined, but not used
WARNING: Token 'SEPARATOR' defined, but not used
WARNING: There are 2 unused tokens
Generating LALR tables
WARNING: 11 shift/reduce conflicts

Output is still the same as before :/

Yes sorry, you have to write a rule:

def p_double_quoting(p):
    'unary_expression : DPHRASE'
    p[0] = p[1]

Alright did that, now got this warning when importing the library:

WARNING: Token 'SEPARATOR' defined, but not used
WARNING: There are 1 unused tokens
Generating LALR tables
WARNING: 11 shift/reduce conflicts

And still get the same output when parsing. Is there anywhere I need to point to this rule that I've added?

Yes sorry, you should also try yourself ;-) just mimic what's done for PHRASE and report here when you're done !

So yes you have to add it there:

def p_phrase_or_term(p):
    '''phrase_or_term : TERM
                      | PHRASE
                      | DPHRASE'''
    p[0] = p[1]

Also you may want to add it to p_proximity:

def p_proximity(p):
    '''unary_expression : PHRASE APPROX
                        | DPHRASE  APPROX'''
    p[0] = Proximity(p[1], p[2])

Unfortunately still getting the exact same output as before after trying that.

Maybe you have to change PHRASE_RE so that it does not match "" alone ? Or at least "" followed by some char.

So maybe

PHRASE_RE = r'''
(?P<phrase>  # phrase
  "          # opening quote
  (?:        # repeating
    [^\\"]   # - a char which is not escape or end of phrase
    |        # OR
    \\.      # - an escaped char
  )+
  "        # closing quote
  |  # or 
  ""(?!\w)  # empty quote but no char after
)'''

@alexgarel That seems to have done the trick! Thanks very much.