Figure out how to reduce boosts on repeated terms

Question

Figure out how to reduce boosts on repeated terms

Opened this issue 7 months ago · 0 comments

I'm trying to figure out how to solve this problem where e.g. https://name-resolution-sri-dev.apps.renci.org/lookup?string=brigatinib&autocomplete=true&offset=0&limit=10 returns UMLS:C4550665 as a superior result simply (?) because its preferred name contains brigatinib twice, giving it a score of 135.28891 vs 98.12439 for the second result.

One powerful tool we have is boost phrase, which allows us to say e.g. bp=names:human^2 will boost documents that have human in the names field. I'm not sure how to use this here but I'm looking into it. This may allow us to say stuff like clique_identifier_count[5 TO *]^10 to really boost cliques with more than 5 identifiers.

Chatting with ChatGPT about this raised two possibilities:

phrase slop (ps) controls how many tokens are allowed in between search terms, i.e. search for brown fox with ps=2 will match "brown token1 token2 fox". Setting ps=1 will restrict searches to "brown fox" only and (apparently) mean that "brown fox brown fox"
According to ChatGPT, "The DisMax parser doesn't automatically boost documents based on the number of times a term appears within the query." So if we can map everything we want to do to DisMax instead of eDisMax, we might end up in an overall better place.