Allowing double quotes
stevesmit opened this issue · 13 comments
Thank you for the wonderful library.
I have queries that have some field expressions with double quotes. That is, something like the following:
field_name:""expression text""
When these are parsed, they confuse the parser as it thinks that the initial double quotes is an unknown operation and it gets treated as a Phrase.
Here is a sample query and the parsing operation in python:
from luqum.parser import parser
query = 'field_name:""Field Text"" OR field_name:text AND field_name:"more text"'
parser.parse(query)
Here is the current output:
UnknownOperation(SearchField('field_name', Phrase('""')), Word('Field'), OrOperation(Word('Text""'), AndOperation(SearchField('field_name', Word('text')), SearchField('field_name', Phrase('"more text"')))))
This is what the expected output of the parsing operation would look like:
OrOperation(SearchField('field_name', Phrase('""Field Text""')), AndOperation(SearchField('field_name', Word('text')), SearchField('field_name', Phrase('"more text"'))))
Any thoughts on this? The double quotes over the single quotes does have a distinct meaning in this case, hence why I am asking.
Hi @stevesmit where does this double double quote comes from ?
Is it normally supported by Lucene ? For me it's not. See https://lucene.apache.org/core/2_9_4/queryparsersyntax.html
For what I understand, the parser is right there, it's the expression you should fix.
It comes from a company's proprietary query syntax which is very Lucene-like, I suppose. Fair enough - any thoughts on editing the grammar specification to allow parsing such expressions? I don't mind (and would have to!) editing a bit of source on my side.
You'll have to make your own parser.py file (PLY is not very flexible on that point, see issue 49).
You can change PHRASE_RE
but I imagine you have to verify that you have as much "
at the begining as in the end "
. So you may better go duplicating most of it ! So maybe adding a DPHRASE_RE
copying PHRASE_RE
but with double quote and a t_DPHRASE
alike t_PHRASE
is the best way to go.
Alright I tried that, but it unfortunately doesn't parse it (still giving the same output as before). I added the following to pieces of code to parser.py as you mentioned:
DPHRASE_RE = r'''
(?P<phrase> # phrase
"" # opening double quotes
(?: # repeating
[^\\"] # - a char which is not escape or end of phrase
| # OR
\\. # - an escaped char
)*
"" # closing double quotes
)'''
@lex.TOKEN(DPHRASE_RE)
def t_DPHRASE(t):
m = re.match(DPHRASE_RE, t.value, re.VERBOSE)
value = m.group("phrase")
t.value = Phrase(value)
return t
Is there any other code that needs to be edited to take account of this change?
Maybe add DPHRASE
in precedence
, before PHRASE
?
Also add DPHRASE in tokens
.
Alright I did that, and I got the following notice when loading the library:
WARNING: Token 'DPHRASE' defined, but not used
WARNING: Token 'SEPARATOR' defined, but not used
WARNING: There are 2 unused tokens
Generating LALR tables
WARNING: 11 shift/reduce conflicts
Output is still the same as before :/
Yes sorry, you have to write a rule:
def p_double_quoting(p):
'unary_expression : DPHRASE'
p[0] = p[1]
Alright did that, now got this warning when importing the library:
WARNING: Token 'SEPARATOR' defined, but not used
WARNING: There are 1 unused tokens
Generating LALR tables
WARNING: 11 shift/reduce conflicts
And still get the same output when parsing. Is there anywhere I need to point to this rule that I've added?
Yes sorry, you should also try yourself ;-) just mimic what's done for PHRASE and report here when you're done !
So yes you have to add it there:
def p_phrase_or_term(p):
'''phrase_or_term : TERM
| PHRASE
| DPHRASE'''
p[0] = p[1]
Also you may want to add it to p_proximity
:
def p_proximity(p):
'''unary_expression : PHRASE APPROX
| DPHRASE APPROX'''
p[0] = Proximity(p[1], p[2])
Unfortunately still getting the exact same output as before after trying that.
Maybe you have to change PHRASE_RE so that it does not match ""
alone ? Or at least ""
followed by some char.
So maybe
PHRASE_RE = r'''
(?P<phrase> # phrase
" # opening quote
(?: # repeating
[^\\"] # - a char which is not escape or end of phrase
| # OR
\\. # - an escaped char
)+
" # closing quote
| # or
""(?!\w) # empty quote but no char after
)'''
@alexgarel That seems to have done the trick! Thanks very much.