Inaccurate Results in Search Functionality

Question

Inaccurate Results in Search Functionality

knbr13 opened this issue 6 months ago · 4 comments

Hello Link-
Happy New Year!
I hope you are well.

I have been enjoying using this useful tool, but after multiple uses, I have identified an opportunity for improvement.
I will illustrate the problem firstly through an example.

Example:

My GitHub username is "knbr13," and the provided value for the flag --find is "git." The command I used in dev mode is as follows:

./gh-stars --user knbr13 --find git --debug --limit 50

Currently, I have 47 starred repos on my profile, which is why I set the limit to 50 to see all matched repos.

The results show 45 repos, but out of all my starred repos, only 2 are related to Git.
Therefore, the expected output is 2 repos, while the actual result is 45 repos.

Problem:

When the user searches for something that contains a few letters (as in the case above, "git" with 3 letters), there's a high chance to match a lot of repos. This is because the code uses the fuzzy.LevenshteinDistance function for string comparison, and the result of this function when comparing two distinct strings with few letters is little (almost in the match range rank >= 0 && rank <= MAX_FUZZY_DISTANCE).

For example:
fuzzy.LevenshteinDistance("git", "is")
The return value (difference) is 2, so it is considered a match. The word "is" is included in a lot of GitHub repo's description.
This is just one example, and there are many other similar cases.

Note: This problem shines in cases where the find flag we provided contains few letters, but it also occurs when the find flag value contains more letters.

The use of the priority queue partially solved this problem.

Why Partially?

Even though repos matching by repo name have higher priority, some repos match by name because they include '-' or '_', while the repo name is significantly diffferent than the searched value.

Example:

One of my starred repos is "go-cleanhttp" by hashicorp, the strings.FieldsFunc function splits the repo name into 2 strings, ["go", "cleanhttp"].

While comparing repo names (in the loops), this function will be called:
fuzzy.LevenshteinDistance("git", "go") // "git" is the needle // "go" is the word,
The return value (difference) is 2, which statisfies the match condition, so the priority of this item in the priority queue is 1000 since it matched by repo name, but "git" and "go-cleanhttp" are so different, if the user searched for "git", he is absolutely not expecting something like "go-cleanhttp".

Note

I have some ideas to potentially solve this issue.
However, I would like firstly to confirm if you are interested in addressing this problem.
If so, I'm willing to contribute and create a pull request once a solution is developed.

Answer 1 · 2024-01-03T10:01:23.000Z

Great issue, thank you! The search function is very naive and needs a major overhaul. For now I recommend you use the more reliable https://github.com/Link-/starred_search

We need to improve the search functionality not by adding more conditions to the function, but by using something more robust. The other extension I referenced uses https://github.com/lucaong/minisearch which is super good and does very lightweight and fast indexing. It also has robust implementations of different search strategies.

If you can find a Go package that offers the same capabilities, let's explore using that instead of reinventing the wheel here.

Answer 2 · 2024-01-05T01:12:27.000Z

Okay, I'll check out the suggested alternative.

I do have a quick question regarding the search functionality.
Why not consider using the strings.Contains function for matching the search value?

For example:
- find "go", strings.Contains(repo.Name, "go") to find matches like ["go-github", "go-cleanhttp", "go-retryablehttp"].
- find "http", strings.Contains(repo.Name, "http") ==> ["go-cleanhttp", "go-retryablehttp", "httprate"].
- find "git", strings.Contains(repo.Name, "git") ==> ["go-github"].
The same works with repo topics and description.

It seems more direct and addresses the issue of unintended matches with minimal complexity.
For this tool, I think it's more than enough.
What are your thoughts on this?

Answer 3 · 2024-01-05T19:18:16.000Z

@knbr13 - works for me, wanna create a PR so that we can test it out?

Answer 4 · 2024-01-05T23:10:07.000Z

yeah for sure,
I'll update the code, update the tests, then create a PR.