INL/BlackLab

Sensitive comparator treats different strings as identical

Closed this issue · 2 comments

In our forward index Terms classes, we assume that in the sensitive sort order, each and every term will get a unique sort index (i.e. no two terms are considered equal by the sensitive comparator).

But this is not the case, because collators and Unicode are complicated. For example, a weird Unicde whitespace character might not be recognized as whitespace by Java's regex engine, but be considered equal to a space by a sensitive collator. This leads to all sorts of problems (terms.indexOf(term) sometimes yielding -1 even though the term is in the index, among other issues).

Another way to state this bug is that our normalization/desensitization operation during indexing doesn't go far enough. If it did, we would only be left with terms that are considered unique by the sensitive comparator.

Possible solutions:

1: Better normalization/desensitization while indexing

Make sure that, while indexing, any variation that the sensitive collator would consider equal is filtered out (using Unicode normalization, character replacing, etc.).

This could be tricky because we would kind of be re-implementing the collator logic in our desensitization code. It would be difficult to verify we cover every possibility. Also switching to a different collator might cause new issues.

2: Make the sensitive collator stricter

Right now the sensitive collator it's set to TERTIARY, but it could be set to IDENTICAL.

This is easy to do and would work, especially now that we do Unicode normalization while indexing. However, it would mean that certain strings that look the same (the ones causing issues now, with e.g. on of Unicode's weird whitespace characters) would be treated as distinct, which is likely not what users expect.

3: Treat sensitive comparison the same as insensitive

We may have to accept that, when the sensitive collator says that two terms are equals, it does NOT follow that their strings are identical.

This would mean changing our forward index Terms class so that both insensitive and sensitive terms are sorted in groups of term strings considered equals for that collator. Sensitive would often have groups of size 1, but not always. This might be the most real-world solution that works with every imaginable collator. It might slightly slow down certain operations, although likely not by much.

It seems options 1 and 2 have clear downsides, especially 1 seems like something that will never be truly watertight.
So I'd say implement option 3. I believe the performance downside is pretty negligible, we haven't run into any issues so far with the insensitive terms index code.

Agreed!

I did realize the issue is even more complicated, as we index sensitively and insensitively in Lucene as well, So there can still be slight differences in finding hits and sorting/grouping, but those should be less problematic and we can address them as needed.