nfdi4plants/Swate

[BUG] Improve template search functionality

Closed this issue · 4 comments

When searching templates in ARCitect, the search results are sometimes suboptimal.

OS and framework information:

  • OS: Ubuntu 22.04
  • ARCitect version: v0.0.40

Describe the bug
Example:

  1. Searching by template name;
    • There are multiple templates in the primary list titled ENA - XXXX
    • Typing ENA in search bar turns up 0 results
    • Typing ENA - in search bar now turns up just one of the results

Screenshots example 1, template search

For several templates named ENA - ... we see there are several in the llist of templates:

image

However when we enter ENA in the search box we get no results:

image

And when we type ENA - in the search bar we get one of them as a result:

image

@Freymaurer can you move this to Swate?

Hey! Could you pls open two issues for this? As the two problems you describe are not related to each other. Feel free to keep this one for Template search and open another one for term search.

done 👍

The reason behind this behavior is our search algorithm. We use sorensen dice on string bigrams. A lot of fancy words for "we look for similiarity and the more similiar the two strings we compare the higher the score", and to filter out unfit results we apply a threshold. In your example "ENA - " has actually more similiarity to SRA - Sequencing than to the longer ENA names. For example in "ENA - Gene promoter annotated sequence", we have ~30 missmatch characters. In "SRA - Sequencing" we have only 11 missmatch characters. This very flexible calculation allows for semi-similiar result search. To avoid your described issues we know adjust the score as follows:

  • Increase score drastically if it starts with query (+0.5)
  • Increase score if contains query (+0.3)

Note

Threshold is 0.3

Image