Allowing double quotes

Question

Allowing double quotes

stevesmit opened this issue 4 years ago · 13 comments

Thank you for the wonderful library.

I have queries that have some field expressions with double quotes. That is, something like the following:

field_name:""expression text""

When these are parsed, they confuse the parser as it thinks that the initial double quotes is an unknown operation and it gets treated as a Phrase.

Here is a sample query and the parsing operation in python:

from luqum.parser import parser
query = 'field_name:""Field Text"" OR field_name:text AND field_name:"more text"'
parser.parse(query)

Here is the current output:

UnknownOperation(SearchField('field_name', Phrase('""')), Word('Field'), OrOperation(Word('Text""'), AndOperation(SearchField('field_name', Word('text')), SearchField('field_name', Phrase('"more text"')))))

This is what the expected output of the parsing operation would look like:

OrOperation(SearchField('field_name', Phrase('""Field Text""')), AndOperation(SearchField('field_name', Word('text')), SearchField('field_name', Phrase('"more text"'))))

Any thoughts on this? The double quotes over the single quotes does have a distinct meaning in this case, hence why I am asking.

Answer 1 · 2020-06-23T16:30:54.000Z

Hi @stevesmit where does this double double quote comes from ?
Is it normally supported by Lucene ? For me it's not. See https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

For what I understand, the parser is right there, it's the expression you should fix.

Answer 2 · 2020-06-23T16:33:02.000Z

It comes from a company's proprietary query syntax which is very Lucene-like, I suppose. Fair enough - any thoughts on editing the grammar specification to allow parsing such expressions? I don't mind (and would have to!) editing a bit of source on my side.

Answer 3 · 2020-06-23T16:44:56.000Z

You'll have to make your own parser.py file (PLY is not very flexible on that point, see issue 49).

You can change PHRASE_RE but I imagine you have to verify that you have as much " at the begining as in the end ". So you may better go duplicating most of it ! So maybe adding a DPHRASE_RE copying PHRASE_RE but with double quote and a t_DPHRASE alike t_PHRASE is the best way to go.

Answer 4 · 2020-06-23T17:03:34.000Z

Alright I tried that, but it unfortunately doesn't parse it (still giving the same output as before). I added the following to pieces of code to parser.py as you mentioned:

DPHRASE_RE = r'''
(?P<phrase>  # phrase
  ""          # opening double quotes
  (?:        # repeating
    [^\\"]   # - a char which is not escape or end of phrase
    |        # OR
    \\.      # - an escaped char
  )*
  ""          # closing double quotes
)'''

@lex.TOKEN(DPHRASE_RE)
def t_DPHRASE(t):
    m = re.match(DPHRASE_RE, t.value, re.VERBOSE)
    value = m.group("phrase")
    t.value = Phrase(value)
    return t

Is there any other code that needs to be edited to take account of this change?

Answer 5 · 2020-06-23T17:31:09.000Z

Maybe add DPHRASE in precedence, before PHRASE ?

Answer 6 · 2020-06-23T17:31:33.000Z

Also add DPHRASE in tokens.

Answer 7 · 2020-06-23T17:52:31.000Z

Alright I did that, and I got the following notice when loading the library:

WARNING: Token 'DPHRASE' defined, but not used
WARNING: Token 'SEPARATOR' defined, but not used
WARNING: There are 2 unused tokens
Generating LALR tables
WARNING: 11 shift/reduce conflicts

Output is still the same as before :/

Answer 8 · 2020-06-23T20:50:14.000Z

Yes sorry, you have to write a rule:

def p_double_quoting(p):
    'unary_expression : DPHRASE'
    p[0] = p[1]

Answer 9 · 2020-06-24T13:56:38.000Z

Alright did that, now got this warning when importing the library:

WARNING: Token 'SEPARATOR' defined, but not used
WARNING: There are 1 unused tokens
Generating LALR tables
WARNING: 11 shift/reduce conflicts

And still get the same output when parsing. Is there anywhere I need to point to this rule that I've added?

Answer 10 · 2020-06-24T14:24:10.000Z

Yes sorry, you should also try yourself ;-) just mimic what's done for PHRASE and report here when you're done !

So yes you have to add it there:

def p_phrase_or_term(p):
    '''phrase_or_term : TERM
                      | PHRASE
                      | DPHRASE'''
    p[0] = p[1]

Also you may want to add it to p_proximity:

def p_proximity(p):
    '''unary_expression : PHRASE APPROX
                        | DPHRASE  APPROX'''
    p[0] = Proximity(p[1], p[2])

Answer 11 · 2020-06-24T14:45:30.000Z

Unfortunately still getting the exact same output as before after trying that.

Answer 12 · 2020-06-24T15:55:23.000Z

Maybe you have to change PHRASE_RE so that it does not match "" alone ? Or at least "" followed by some char.

So maybe

PHRASE_RE = r'''
(?P<phrase>  # phrase
  "          # opening quote
  (?:        # repeating
    [^\\"]   # - a char which is not escape or end of phrase
    |        # OR
    \\.      # - an escaped char
  )+
  "        # closing quote
  |  # or 
  ""(?!\w)  # empty quote but no char after
)'''

Answer 13 · 2020-06-24T16:11:47.000Z

@alexgarel That seems to have done the trick! Thanks very much.