qdrant/qdrant

filter group on with_lookup points

Opened this issue · 2 comments

Is your feature request related to a problem? Please describe.

Currently there is no way (that I know of) to filter a group query by the payload of the points linked through with_lookup.

Describe the solution you'd like

Add a separate filter that operates on the with_lookup points.

Describe alternatives you've considered

My current workaround is to duplicate the relevant fields on the main points which kind of defeats the point of using the with_lookup feature.

Additional context

Happy to provide more info / code examples if needed.

Hey @JosuaKrause, this is reasonable request, but it is very hard to implement in distributed setup. You would basically need to do a distributed join, which proved to have very poor performance. It is unlikely that we will implement exactly this option any time soon.

That is sad to hear. Besides the workaround I mentioned above (i.e., duplicating payloads) I experimented with two additional workaround strategies:

a) retrieve groups as is and manually filter the joined payload afterwards. if the result is now shorter than the required length repeat the groups query with a larger limit until you have enough results or you have exhausted all points.

b) perform the filter on the linked points first and retrieve the key/id used for joining. then add a filter to match those keys/ids to the group query

both approaches work okay (i.e., it takes multiple minutes for some queries but it avoids duplicating payload data). if the number of matches is low a) performs badly and if the number of matches is high b) performs badly (in both worst cases we have to scan through all points). I can guess how many points might match to some degree for simple filters so I use that to decide the strategy.

even though those strategies are not optimal their speed would improve significantly if implemented in qdrant:

a) can continue the query if needed without recomputing previous results for each subsequent query (right now I double the limit each time which results in a O(2n) runtime instead of a quadratic runtime when increasing the limit linearly)

b) can collect the keys/ids internally and immediately use them without sending all of them to the client first and the client sending all back. in a distributed setting each node can compute and use their own keys/ids without sharing them with other nodes if corresponding points of the collections are sharded the same way

could be opt-in with huge warnings about performance

some info about my db (not too big but hopefully illustrates why duplicating payloads is not ideal): document collection ~9000 (no vectors but payloads), snippet collection ~210000 (no payload but vectors)