finwo/lucene-filter

Thoughts on luce-filter

Opened this issue · 2 comments

Inspired from this, I've created luce-filter. One key feature that I really wanted, and one that I wasn't sure would even be in the scope of this library due to its API, was to give the user more control via providing them with a customizer callback:

filter(data, 'size:>1k', (row, { term, ...rest }) => {
  /* do something custom */
  term = term.replace('k', '000');
  /* then revert to default callback */
  return filter.default(row, { term, ...rest });
})

I'd be really keen to get your thoughts on my implementation.

In particular, how you're handling the operators - AND/OR/NOT.

I'm struggling with implementing these myself and I'm having a hard time wrapping my head around your implementation.

Like, this is your AND operator:

let rl = l(data) || 0,
  rr = r(data) || 0,
  rla = Math.abs(rl),
  rra = Math.abs(rr);
if (rla > rra) return rr;
if (rla < rra) return rl;
return Math.min(rl, rr);

VS. mine:

/* Process "left" side of the tree */
let leftData = filter(data, ast.left, opts.filter);
if (ast.operator === 'AND') {
  /* In "AND" case, use the result of the "left" side as an input data for the "right" side */
  data = leftData;
  return filter(data, ast.right, opts.filter);
}

My approach is obviously extremely simple and straightforward. Probably too simple. Which is why I'm running into issues when handling more complex queries, like even just x AND y AND NOT z. And actually I'm not entirely sure even the simpler ones are evaluated 100% correctly, I haven't tested it that much.

I'd be really interested in learning your approach. I can't seem to wrap my head around it. Like, why are you using numbers logic (0, Math.min) to begin with..? Shouldn't filtering simply be in boolean logic? I.e. it should return just true/false.

If this builds from some mathematical construct I'm unaware of I'd be really interested in learning.

finwo commented

Global thoughts on your library:

You have a fair point, I haven't written this library in a very readable way. I should probably give it a rewrite to allow others (and myself) to actually read the code instead of just wondering what I wrote.

The reason I'm using math instead of boolean logic, is because other lucene implementations (like the one used in solr) have the ability to give a score to an object (how well the object matches the query). The idea was to eventually implement a better scoring ability in the library once I figured out how those work in other libraries.

Filtering (in my library), should become as simple as giving a threshold of how well a document/object should match the query.

For use as an abstract query-engine, your solution using boolean logic actually makes more sense. The "loose matche" feature of using a score gives related results instead of exact results if used properly with a low threshold, like search engines usually do.

TL;DR;

I like your approach, as it's better suited than my library as a query engine.

finwo commented

About the operator:

My implementation first applies the r and l filters on the given data. Each of these filters, may be a multitude of filters, combining multiple filters using operators to produce a single, more complex, filter.

Once a score for the left and right section has been generated (rr and rl), an abstract version (value < 0 ? -value : value) is generated of these (rra and rla respectively).

The and operation itself chooses the score which is closes to zero. This way, a combination where both are far away from zero results in an output far away from zero.

For the life of me, at this point I can't remember why negative numbers were a thing, but that might've had something to do with the AND NOT and OR NOT operators.