coady/lupyne

Combining Querys with BooleanQuerys

Closed this issue · 7 comments

Hi @coady, thanks for all your hard work on lupyne, its been super helpful for me! I used your Dockerfile as a basis for compiling JCC & PyLucene to wheel files in my own non-Docker environment and now I've been able to successfully run some of the examples and setup my own 14 GB corpus, index it to a directory, and do some basic searches based on the examples you provided in the docs.

Right now I'm trying to write a slightly more complex query, but was having some trouble and hoping you might be able to point me in the right direction.

I have a fairly simple index that has 4 stored fields. A text field containing the article text, a text field containing the name of the company (the list of company names is finite and each document is associated with exactly one company), a datetime field that contains the date the article was published, and an article id.

I'm trying to write a query that does the following: find all documents that contain the phrase "lupyne is great" and occur between some arbitrary date range and that have a company_name field value of 'company a', 'company_b', or 'company_c'.

I've tried the following:

import lucene
from lupyne import engine
from datetime import date

assert lucene.getVMEnv() or lucene.initVM()

index_path: str = r'myindexdir'

query_str: str = 'lupyne is great'
start_date: date = date(year=2020, month=2, day=14)
companies: [str] = ['company a', 'company b', 'company c']

indexer = engine.Indexer(index_path, mode='r', nrt=True)

indexer.set('article_id', stored=True)
indexer.set('company_name', stored=True)
indexer.set('date', engine.DateTimeField, stored=True)
indexer.set('text', engine.Field.Text, stored=True)

query_engine = engine.Query

# The following works with the query string 'lupyne'
query_str: str = 'lupyne'
query = indexer.fields['date'].range(start_date, None) & query_engine.term('text', query_str)

# This does not with the query_string 'lupyne is great',
query_str: str = 'lupyne is great'
query = indexer.fields['date'].range(start_date, None) & query_engine.phrase('text', query_str)
# TypeError: unsupported operand type(s) for &: 'Query' and 'MultiPhraseQuery'

# This also does not work
range_query = query_engine.range('date', date_field.timestamp(start_date), None)
# java.lang.IncompatibleClassChangeError
#        at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:84)

# This will also break
range_query = query_engine.range('date', start_date, None)
# lucene.InvalidArgsError: (<class 'org.apache.lucene.util.BytesRef'>, '__init__', (datetime.date(2021, 2, 2),))

Any suggestions on how I might go about this? Thanks again for all the hard work!

EDIT: So, it looks like this might be because Query.ranges() doesn't return a lupyne Query object as seen here, but instead directly returns a pylucene query object. Any good way to get around this?

coady commented

EDIT: So, it looks like this might be because Query.ranges() doesn't return a lupyne Query object as seen here, but instead directly returns a pylucene query object. Any good way to get around this?

That's right. Unfortunately some of the query types have static constructors, so there's no way to subclass them. But there is a workaround: classmethods Query.any and Query.all work with any query type.

Another observation, the phrase queries don't automatically parse. So it should be Query.phrase('text', 'lupyne', 'is', 'great').

coady commented

Better documented in 164f99c.

Thank you, this is super helpful and makes a lot more sense! In case anyone stumbles upon this and has a similar question, here is an example of how you would apply any() assuming the index I described above.

query_str: str = 'lupyne is great'
query = query_engine.all(indexer.fields['date'].range(start_date, None), query_engine.phrase('text', *query_str.split(' ')))

I've got one last question if you don't mind. The other field I'm trying to filter on is 'company_name', which honestly I thought would be the simplest, but is actually giving me some trouble.

During indexing, I've tried it a couple ways, where I set it as the default field type and then also explicitly setting the field type as engine.Field.Text. Either way, I'm unable to do a prefix search as you demonstrate in the examples here, unless I set it as a NestedField and use a separator that doesn't occur in any of the company names, which is fine, but does seem a bit awkward and makes me think I'm just doing something wrong. Also, any prefix searches I construct as if the field is just a regular text field don't seem to work. Term, Terms, and Phrase, none of them seem to work when the field is just set as the default or engine.Field.Text. Is there something I'm missing? Here's some examples I've tried that don't work and the one that does seem to.

# We know there exists a result for this query, but none are returned
query = query_engine.all(indexer.fields['date'].range(start_date, None), query_engine.prefix('company_name', 'company a'))

query = query_engine.all(indexer.fields['date'].range(start_date, None), query_engine.phrase('company_name', *'company a'.split(' ')))

# If I create the field as follows things seem to work as expected.
indexer.set('company_name', engine.NestedField, sep='#', stored=True)

Thanks again for your time!

coady commented

Lucene doesn't really have a default field type (i.e. its default does nothing). A Text field means it will be tokenized and indexed as words, whereas String will index the whole string as is.

It's hard to say from this example, but I don't think you want NestedField. That's for hierarchies.

If it's a String field (untokenized), the query prefix('company_name', 'company a') would work because there's a space in "company a".

if it's a Text field (tokenized), the query phrase('company_name', 'company', 'a') would work.

It's hard to say from this example, but I don't think you want NestedField. That's for hierarchies.

Yes, that's exactly what I thought. But setting the field as Text didn't seem to work, even when I attempted to search it just as I would my primary document text field, so I was wondering if I was doing something wrong. I didn't realize String was even an option though, so I think that's probably the right solution for me. Even if it's not, I've got it all working now with the NestedField workaround.

Thanks so much for all your help and hard work! Lupyne was invaluable to me in figuring out how to get everything setup, even compilation of pylucene and JCC.

Also, for what it's worth, I was able to make JCC generate a semi-portable wheel file for PyLucene using the somewhat new, but undocumented --wheel parameter during compilation. If it's helpful, I'm happy to submit a PR with that change to the docker image repo :)

ljak commented

Hi @ZeroCool2u

Currently playing with PyLucene and Lupyne, I'm highly interested by your method to create a wheel. Btw, I think that it will make the adoption of Lupyne much greater if PyLucene was easier to install as dependency. I will be glad to review your PR if submitted :) Thanks!

and Thanks to @coady for the hard work behind Lupyne!

@ljak been busy, so I haven't gotten around to doing this yet, but I am happy to.

We'll have to be very careful to clarify that the portability is truly more limited than the wheel file name implies and still requires a working OpenJDK 8 (I haven't tried with JDK 11) installation setup similarly to the one that exists at compilation time.

However, it seems as long as the JDK and tool chain matchup the wheel file can still be used and you can dodge dealing with compilation.