helix-editor/nucleo

Higher score for shorter matches?

jessegrosjean opened this issue ยท 9 comments

First, thanks for posting this project!

I know you are trying to match fzf, but I'm finding some of the fzf scoring hard to make sense of. For example, consider these two cases:

  1. Moby Dick
  2. Though I cannot tell why it was exactly that those stage managers, the Fates, put me down for this shabby part of a whaling voyage

If I search for "md" the second example scores the highest matching "me down". This is also fzf behavior, but it doesn't seem right to me. Would it make sense to incorporate the percent of matching indexes into the score calculation somehow?

So the reason that me down scores higher than Moby Dick is that it has a smaller gap between the matched characters (m and d) and has otherwise exactly identical score (both are word boundaries). Nucleo (and fzf) are considered as substring match tools so they don't care about how much text there is before or after the text.

It would be odd if you typed foobar and foo/test/bar matched higher than a/really/long/prefix/foobar.

In general, it's not a good idea to treat fuzzy matches as acronym matches (it's not what fzf or nucleo were designed for). If you don't care about the distance between words you should type m d instead (which is not the same thing as each word is treated as its own match.. alteast with the high-level API/helix) which will give both texts the same score (and therefore rank moby dick first since shorter matches are preferred in case of ties).

I don't think I would want to change the behavior of the matching algorithm here. I spend a lot of time looking into various matching algorithms and found that I agree with fzfs approach here.

I don't think I would want to change the behavior of the matching algorithm here. I spend a lot of time looking into various matching algorithms and found that I agree with fzfs approach here.

Nucleo's implementation is so fast I plan to use it, but I think I will need to tweak the scoring somehow. Everywhere else that I try the "md" search it is scoring "Moby Dick" higher:

  • Sublime
  • Visual Studio
  • Xcode

Of the algorithms that you looked at can you recommend a 2nd best, that would perform more like the listed tools?

At the moment I'm post processing the scores by adding (pattern.len / candidate.len) * 32. The helps, and works without modifying nucleo, but doesn't seem like a very sophisticated approach.

I am not intimately familiar with all of these implementations (well xcode is closed source AFAIK so I never looked at that). I think the main difference is likely that these tools penalize the gap between the match start/end while nuceo and fzf don't. Like I said nucleo (and fzf) are substring match tools so that's an intentional design choice.

Your current postprocessing is likely not a good idea and will break nucleos scoring (for example by double penalizing gaps within the pattern). The numbers are very finely tuned.

The best way to implement this would be to calculate that penalty: let penalty = match_start + (candidate.len() - match_end); (note that you should use the len of the utf32st here and absolutely not the normal rust string len). You likely want to create a pretty small weight to that by scaling the original score let adjusted_scored = (ADJUST * score) as i32 - penalty as i32; . How large you want to make ADJUST is subjective. Gaps inside the pattern have a penalty of 1 (altough the first gap char has a penalty 3). You likely want to penalize prefix/postfix gap much less so I think ADJUST should be at least 10 potentially even higher.

The problem with this approach is that Nucleo does not compute the match start/end by default. You can use the indices method instead but that adds overhead. The end is trivially known but the start can be only computed once the entire match has been found so that is not easy. It could be done slightly more efficiently than currently (by only saving the last index instead of all of them but not much more efficiently). Since I am not really interested in supporting that kind of scoring in nucleo, you probably would need to go with the indices approach for now.

I saw in another comment that you are planning to polish up the nucleo matcher into a more final state.

I don't know if this fits with your vision, but I've been getting better results, for my purposes, by passing in the current haystack index into the bonus_for function like this:

impl MatcherConfig {
    #[inline]
    pub(crate) fn bonus_for(&self, prev_class: CharClass, class: CharClass, haystack_index: usize) -> u16 {
        return self.bonus_for_class(prev_class, class) + self.bonus_for_haystack_index(haystack_index);
    }

    #[inline]
    fn bonus_for_haystack_index(&self, haystack_index: usize) -> u16 {
        if haystack_index == 0 {
            BONUS_BOUNDARY / 2
        } else {
            0
        }
    }

    #[inline]
    fn bonus_for_class(&self, prev_class: CharClass, class: CharClass) -> u16 {
   ...

With that change in place I no longer have to do any score boost for shorter matches adjustment. I expect this behavior would be off by default, but it would be nice if it could be a config option. It would also be interesting if the various scoring constants were part of, and configurable through, MatchConfig, for easy experimentation if nothing else.

In the end these changes weren't too hard, so if you decide not to include it's not that big a deal, I can just maintain my own changes. Thanks again.

Thinking more... I'll be quiet soon.

Maybe most configurable, and smallest change to matcher, would be to give MatcherConfig an optional bonus_for closure. A closure would allow this crate to replace many custom fuzzy match schemes, while providing much better performance.

I guess passing in (pre_class, class, haystack_index, haystack_string) would give pretty much unlimited flexibility.

I am not a fan of that at all. Firstly I don't want to complicate the API and I also don't want to give an API stability guarantee to the char classes. Those are implementation details which I don't plan to expose. The heristic you showed also doesn't work well at all IMO. Skim actually does this by accident (they vrebtaim copied the comments explaining the motivation from fzf but fzf/nucleo actually multiply the bonus given to the first pattern char instead of the first haystack char...). This leads to skim preferring some weird matches. For example, matching satus (forgegging the t) against src/statls.rs matches s__/__atus.rs while nuceo and fzf match ___/s_atus.rs. This is particularly strange because it does not occur with a prefix /src/status.rs` behaves the same in all matchers. Single char bonuses like this don't really work well IMO (especially because there is already a bonus for a match a the start of the needle: the same bonus a word boundary receives).

I looked a bit at other matches and found that for matches where the goal is the match the full word (instead of a substring) usually add a gap penalty for the starting gap. This could be done in the penalty function you showed but I prefer to just build this in as an option (and somewhere else in the code since that particular function is pretty hot). I think this could be a reasonable addition for downstream users that use nucleo more as an autocompletion tool instead of as a full fuzzy matcher.

Thanks for all your attention on this.

A bit more on my motivation. I want to use nucleo in my outliner app (like workflowy, etc), and use it to find and navigate to specific rows in the outline. Compared to file system paths, outline text has quite a bit more size variation. Some rows are short topic rows while other rows may contain long paragraphs of text. The general issue that I'm having is the long paragraphs of text dominate the match score results. This is particular problematic since short topic rows are the ones that I want to navigate to the most often.

And for the short topic rows I really would just like to type acronyms to navigate to them. For example right now I have a topic named "Fuzzy Search". And to find/navigate I would like to just type "fs", but that's always matching random other notes. All my various hack are trying to get around that problem.

I think adding a gap penalty as you suggest would help, but maybe still not solve the problem. You would have less "false" matches for short acronyms, but you would still match those acronyms often in the longer text.

With all that in mind what I've been trying today (and works best so far) is to just do a special case acronym matching phase and compare that to normal fuzzy matching, and return the best result.

  • if needle[0] = haystack[0]
    • copy first 4 acronym characters (letters after space, etc) from the haystack into new string
    • use normal nucleo fuzzy scoring against that haystack acronym string.
    • also use normal nucleo fuzzy scoring to against the full haystack string.
    • Return the score/indexes that score highest.

This gives priority to acronyms for the cases when I really want them to match, without unbalancing the normal nucleo scoring. Seems to be working great and I think it's probably better than trying to tilt the nucleo scoring to handle both cases.

I'm handling this outside of nucleo right now, but it might be useful to build in an "acronym" matching mode. That would handle extracting the acronym, and then mapping the indexes that match the acronym back into the full haystack.

If you think that would make sense as an addition I can try porting my swift code to rust and make a pull request. Also I'm happy to just continue doing outside of uncle.

I don't think I would want to add a ancroynm matcher. It sounds like your out of tree Implementation works well and I don" the overhead nattess so I would just stick with that. I don't really want to expand the scope of nucleo too much.

Regarding the prefix gap bonus/penalty you can give #16 a spin. The prefer_prefix behavior enables the behavior I described above. It's not acronym matching but it should give a stronger preference to shorter and prefix matches

Regarding the prefix gap bonus/penalty you can give #16 a spin. The prefer_prefix behavior enables the behavior I described above. It's not acronym matching but it should give a stronger preference to shorter and prefix matches

I have and it helps a lot. I think the results make a lot more sense for unstructured text. The crate is really working well for me. I also appreciate that you've moved pattern and utf32_str into the matcher package. You'll be pleased to know that I've run out of request. Thanks again.