elixir-haystack/haystack

Custom stop words

Opened this issue · 2 comments

Is it possible to customize the stop words used, so I can provide a different list other than the default one or disable stop words?

Context: I'm setting up Haystack for the search in https://rocketvalidator.com/html-validation - currently it just uses a simple search by substring but I want to use Haystack instead. So far it's going great!

During the integration, I found that the results were not as expected in many searches, and it looks like it was due because most of the titles include characters like double quotes:

https://rocketvalidator.com/html-validation/a-link-element-must-not-appear-as-a-descendant-of-a-body-element-unless-the-link-element-has-an-itemprop-attribute-or-has-a-rel-attribute-whose-value-contains-dns-prefetch-modulepreload-pingback-preconnect-prefetch-preload-prerender-or-stylesheet

So when I searched for something containing double quotes, these guides would appear first as they scored higher because they have many double quotes.

I guess this could be solved by adding the double quotes (and other characters like parenthesis, brackets, < and >, etc.) to the stop words. My workaround was to clean up the strings, both during the load and the search:

defp cleanup(str) do
  str
  |> String.replace(["“", "”", "<", ">", "(", ")", ".", ",", ";", ":"], "")
  |> String.trim()
end

After that, I found that a search for must not appear like this https://rocketvalidator.com/html-validation?search=must+not+appear provided no results using Haystack, and that's because these are all stop words.

Finally, nor non-English content it would be great to be able to customize the stop words.

Hey @jaimeiniesta,

Yeah, you can pass a custom list of transformer modules when adding a field:

|> Keyword.put_new(:transformers, Transformer.default())

So you could either pass your own implementation of stop words, or remove it completely. And you can do that on a per-field basis.

Again this needs to be added to the documentation 😅

Ah, that's cool then. I'll wait for that documentation. Thanks! 😎